PL/Java functions before 1.5.1 have been able to access a value of XML type as
a String object. That has been workable, but an extra burden if porting code
that used the JDBC 4.0 java.sql.SQLXML API, and with notable shortcomings.
PostgreSQL stores XML values serialized according to server_encoding, and
depending on that setting, conversion to a Java String can involve
transcoding.
XML has rules to handle characters that may be representable in one encoding
but not another, but the String conversion is unaware of them, and may fail
to produce a transcoding that represents the same XML value.
While a database design using XML may be such that each XML datum is
individually very small, it is also easy to store—or generate in
queries—large XML values. When mapped to a Java String, such an XML value
must have its full, uncompressed, character-serialized size allocated
on the Java heap and be copied there from native memory, before the Java
code even begins to make use of it. Even in cases where the Java processing to
be done could be organized to stream through parse events in constant-bounded
memory, the String representation forces the entire XML value to occupy Java
memory at once. Any tuning of PL/Java's heap size allowance could have to
consider a worst-case estimate of that size, or risk failures at run time.
java.sql.SQLXML APIPL/Java 1.5.1 adds support for this API. A Java parameter or return type in a
a PL/Java function can be declared to be SQLXML, and such objects can be
retrieved from ResultSet and SQLInput objects, and used as
PreparedStatement parameters or in SQLOutput and updatable ResultSet
objects.
SQLXML objectAn SQLXML instance can have the “conceptual states” readable and not
readable, writable and not writable. In PL/Java, an instance passed in as a
parameter to a function, or retrieved from a ResultSet, is readable and not
writable, and can be used as input to Java processing using any of the
following methods:
getBinaryStream()InputStream with the raw, byte-stream-serialized XML, which will
have to be passed to an XML parser. The parser will have to determine the
encoding used from the declaration at the start of the stream, or assume
UTF-8 if there is none, as the standard provides.getCharacterStream()getBinaryStream but as a stream of Java characters, with the underlying
encoding already decoded. May be convenient for use with parsing code that
isn't able to recognize and honor the encoding declaration, but any standard
XML parser would work as well from getBinaryStream, which should be
preferred when possible. A parser working from the binary stream is able to
handle transcoding, if needed, in an XML-aware way. With this method, any
needed transcoding is done without XML-awareness to produce the character
stream.getString()String. Has the
same memory footprint and encoding implications discussed for the legacy
conversion to String, but may be convenient for some purposes or for
values known to be small.getSource(javax.xml.transform.stream.StreamSource.class)Source object, directly usable with Java XML transformation APIs.getSource(javax.xml.transform.sax.SAXSource.class)Source object that presents the XML in parsed form via the SAX API,
where the caller can register callback methods for XML constructs of
interest, and then have Java stream through the XML value, calling those
methods.getSource(javax.xml.transform.sax.StAXSource.class)Source object that presents the XML in parsed form via the StAX
API, where the value can be streamed through by calling StAX pull methods
to get one XML construct at a time. Java code written to this API can more
clearly reflect the expected structure of the XML document, compared to
code written in the callback style for SAX.getSource(javax.xml.transform.sax.DOMSource.class)Source object presenting the XML fully parsed as a navigable,
in-memory DOM tree.getSource(null)Source object of a type chosen by the implementation. Useful when
the Source object will be passed to a standard Java transformation API,
which can handle any of the above forms, letting the SQLXML implementation
choose one that it implements efficiently.Exactly one of these methods can be called exactly once on a readable SQLXML
object, which is thereafter not readable. (The not readable state prevents
a second call to any of the getter methods; it does not, of course, prevent
reading the XML content through the one stream, String, or Source obtained
from the getter method that was just called.)
Except in the String or DOM form, which bring the entire XML value into Java
memory at once, the XML content is streamed directly from native PostgreSQL
memory as Java code reads it, never accumulating in the Java heap unless that
is what the application code does with it. Java heap sizing, therefore, can
be based on just what the application Java code will do with the data.
The most convenient API to use in an application will often be SAX or StAX, in which the code can operate at the level of already-parsed, natural XML constructs. Code designed to work with a navigable DOM tree can easily obtain that form (but it should be understood that DOM will pull the entire content into Java memory at once, in a memory-hungry form that can easily be twenty times the size of the serialized value).
SQLXML objectTo obtain a readable instance, declare java.sql.SQLXML as the type of a
function parameter where PostgreSQL will pass an XML argument, or use the
getSQLXML or getObject(..., SQLXML.class) methods on a ResultSet, or the
readSQLXML or readObject(SQLXML.class) methods on SQLInput. A fully
JDBC-4.0 compliant driver would also return SQLXML instances from the
non-specific getObject and readObject methods, but in PL/Java, those have
historically returned String. Because 1.5.1 is not a major release, their
behavior has not changed, and the more-specific methods must be used to obtain
SQLXML instances.
SQLXML objectPL/Java will supply an empty SQLXML instance that is writable and not
readable via the Connection method createSQLXML(). It can be used as an
output destination for any of several Java XML APIs, through a selection of
set... methods exactly mirroring the available get... methods described
above.
The API is unusual: except for setString, which takes a String parameter
and returns void as a typical “setter” method would, the other setter methods
are used for the object they return—an OutputStream, Writer, or
Result—which the calling code should then use to add content to the XML
value.
Exactly one setter method can be called exactly once on a writable SQLXML
object, which is thereafter not writable. (The not writable state prevents
a second call to any setter method; XML content must still be written via the
stream or Result obtained from the one setter that was just called, except
in the case of setString, which populates the value at once.) Content being
written to the SQLXML object is accumulated in PostgreSQL native memory,
not the Java heap.
A SQLXML object, once it has been fully written and closed, can be
returned from a Java function, passed as a PreparedStatement parameter to a
nested query, or stored into writable ResultSets used for composite function
or trigger results. It can be used exactly once in any of those ways, which
transfer its ownership back to PostgreSQL, leaving it inaccessible from Java.
SQLXML object is considered closedA writable SQLXML object cannot be presented to PostgreSQL before it is
closed to confirm that writing is complete. (One written by setString is
considered written, closed, and ready to use immediately.)
When it is written using a stream obtained from setBinaryStream,
setCharacterStream, or
setResult(javax.xml.transform.stream.StreamResult.class), it
is considered closed when the stream's close method is called.
This will typically not be done by a Java Transformer with the stream
as its result, and so should be explicitly called after such a transformation
completes.
When written using a SAXResult, it is considered closed when the
ContentHandler's endDocument method is called, and when written using a
StAXResult, it is considered closed when the XMLStreamWriter's
writeEndDocument method is called. When one of these flavors of Result is
used with a Java Transformer, these methods will have been called in the
normal course of the transformation, so nothing special needs to be done after
the transformation completes.
What it means to close a DOMResult is murkier. The application code must
call the DOMResult's setNode method, passing what will be the root node of
the result document. This can be done before or after (or while) child nodes and
content are added to that node. However, to avoid undefined behavior,
application code must make no further modification to that DOM tree after the
SQLXML object has been presented to PostgreSQL (whether via a
PreparedStatement set method, ResultSet update method,
SQLOutput write method, or returned as the function result).
Result object as a Transformer resultClasses that extend javax.xml.transform.Transformer will generally accept
any flavor of Result object and select the right API to write the
transformation result to it. There is often no need to care which Result
flavor to provide, so it is common to call setResult(null) to let the
SQLXML implementation itself choose a flavor based on implementation-specific
efficiency considerations.
In the case of a DOMResult, if the Result object is simply passed to a
Transformer without calling setNode first, the Transformer itself will
put an empty Document node there, which is then populated with the
transformation results.
A Document node, however, enforces conformance to the strict rules of
XML(DOCUMENT) form (described below). If the content to be written will
conform only to the looser rules of XML(CONTENT) form, application code should
call setNode supplying an empty DocumentFragment node, before passing the
Result object to a Transformer.
The flavor of Result returned by setResult(null) will never
(in PL/Java) be DOMResult.
SQLXML object as a written oneThe general rule that only a writable instance (that has been written and
closed) can be used as a function result, or passed into a nested query, admits
one exception, allowing a readable instance that Java code has obtained but
not read. That makes it simple for Java code to obtain an SQLXML instance
passed in as a parameter, or from a query, and use it directly as a result or a
nested-query parameter. Any one instance can be used this way no more than once.
XML(DOCUMENT) and XML(CONTENT)An XML value in SQL can have the type XML(DOCUMENT) or XML(CONTENT) (as
those are defined in the ISO SQL standard, 2006 and later), which PostgreSQL
does not currently treat as distinguishable types. The DOCUMENT form must have
exactly one root element, may have a document-type declaration (DTD), and has
strict limits on where other
constructs (other than comments and processing instructions) can occur. A value
in CONTENT form may have no root element, or more than one element at top
level, and other constructs such as character data outside of a root element
where DOCUMENT form would not allow them.
Java code using a readable SQLXML instance as input should be prepared to
encounter either form (unless it has out-of-band knowledge of which form will be
supplied). If it requests a DOMSource, getNode() will return a Document
node, if the value met all the requirements for DOCUMENT, or a
DocumentFragment node, if it was parsable as CONTENT. Java code requesting a
SAXSource or StAXSource should be prepared to handle a sequence of
constructs that might not be encountered when parsing a strictly conforming
DOCUMENT. Java code that requests an InputStream, Reader, String, or
StreamSource will be on its own to parse the data in whichever form appears.
In DOCUMENT form, any whitespace outside of the single root element is
considered markup, not character data. When the value is parsable as DOCUMENT,
and read through PL/Java's SAXSource or StAXSource, no whitespace that
occurs outside of the root element will be reported to the application.
PL/Java's DOMSource will present a Document node with no whitespace
text-node children outside of the root element.
If the value parses as CONTENT, PL/Java's DOMSource will present a
DocumentFragment node with all character data, including whitespace,
preserved. The streaming operation of the SAXSource and StAXSource is more
complicated, and lossy for whitespace (only if it occurs outside of any element)
ahead of the first parse event that would not be possible in DOCUMENT form.
All whitespace beyond that point is preserved.
Java code using a writable SQLXML instance to produce a result may write
either DOCUMENT or CONTENT form. If using DOMResult, it must supply a
DocumentFragment node to produce a CONTENT result, as a Document node will
enforce the DOCUMENT requirements.
SQLXML object has transaction lifetimeThe JDBC spec provides that an SQLXML instance is “valid for the duration of
the transaction in which it was created.” One PL/Java function can hold an
SQLXML instance (in a static or session variable or data structure), and other
PL/Java functions called later in the same transaction can continue reading from
or writing to it. If the transaction has committed or rolled back, those
operations will generate an exception.
Once a writable SQLXML object, or an unread, readable one, has been
presented to PostgreSQL as the result of a PL/Java function or through a
PreparedStatement/ResultSet/SQLOutput setter method, it is no longer
accessible in Java.
During a transaction, resources held by a SQLXML object are reclaimed as soon
as a readable one has been fully read, or a writable one has been presented
to PostgreSQL and PostgreSQL is done with it. If application code holds a
readable SQLXML object that it determines it will not read, or a writable one
it will not present to PostgreSQL, it can call the free method to allow the
resources to be reclaimed sooner than the transaction's end.
PostgreSQL can represent large XML values in “TOASTed” form, which may be in
memory but compressed (XML typically compresses to a small fraction of its
serialized size), or may be a small pointer to a location in storage. A
readable SQLXML instance over a TOASTed value will not be detoasted until
Java code actually begins to read it, so the memory footprint of an instance
being held but not yet read is kept low.
Some of the methods by which a writable instance can be written are not
XML-specific APIs, but allow arbitrary content to be written (as a String,
Writer, or OutputStream). When written by those methods, type safety is
upheld by verifying that the written content can be successfully reparsed,
accepting either DOCUMENT or CONTENT form.
It remains possible to declare the Java type String for function parameters
and returns of XML type, and to retrieve and supply String for ResultSet
columns and PreparedStatement parameters of XML type. This legacy mapping
from String to XML uses PostgreSQL's xml_in function to verify the form of a
String from Java. That function may reject some valid values if the server
configuration variable xmloption is not first set to DOCUMENT or CONTENT
to match the type of the value.
Java's XML APIs support validation using a choice of schema languages; support for XML Schema 1.0 is included in the Java runtime, and implementations of others can be placed on the class path.
A schema method is available through the “Extended API to configure
XML parsers” described below, but will only work on a SAXSource or DOMSource
(or a StreamResult, which uses a SAX parser to validate the stream written).
Other limitations are described under “known limitations” below.
More flexibly, javax.xml.validation.Validator or
javax.xml.validation.ValidatorHandler can be used in more situations and with
fewer limitations.
In symmetry to using Java String for SQL XML types, PL/Java allows the Java
SQLXML type to be used with PostgreSQL data of type text. This allows full
use of the Java XML APIs even in PostgreSQL instances built without XML support.
All of the SQLXML behaviors described above also apply in this usage.
If a readable SQLXML instance obtained from a text value is directly used
to set or return a value of PostgreSQL's XML type, the XML-ness of the content
is verified.
java.sql.SQLXML APIRetrieving or verifying the XML content in an SQLXML object can involve
applying an XML parser. The full XML specification includes features that can
require an XML parser to retrieve external resources or consume unexpected
amounts of memory. The full feature support may be an asset in an environment
where the XML content will always be from a known, trusted source, or a
liability if less is known about the XML content being processed.
The Open Web Application Security Project (OWASP) advocates for the default use of settings that strictly limit the related features of Java XML parsers, as outlined in a “cheat sheet” the organization publishes.
However, the recommended defaults really are severely restrictive (for example,
disabling document-type declarations by default will cause PL/Java's SQLXML
implementation to reject all XML values that contain DTDs). Therefore, there
must be a simple and clear way for code to selectively adjust the settings, or
adopting the strictest settings by default would pose an unacceptable burden to
developers.
The traditional Java way to adjust the XML parser is overwhelmingly fiddly,
involving setFeature or setProperty calls that identify the feature to be
set by passing an arcane URI that might be found in the documentation, or the
cheat sheet, or cargo-culted from some other code base. In some cases,
the streamlined SQLXML API conceals the steps where adjustments would have
to be applied. With no better way to adjust the parser, it would be an
unrealistic developer burden to adopt the restrictive defaults and expect the
developer to relax them.
Therefore, PL/Java has an extension API documented at the
org.postgresql.pljava.Adjusting.XML class. With the API, it is possible
to obtain a Source object from an SQLXML instance sqx in either the
standard or extended way shown in this example for a SAXSource:
SAXSource src = sqx.getSource(SAXSource.class); // OR
SAXSource src = sqx.getSource(Adjusting.XML.SAXSource.class)
.allowDTD(true).get();
The first form would obtain a SAXSource configured with the restrictive,
OWASP-recommended defaults, which would reject any content with a DTD. The
second form would obtain a SAXSource configured to allow a DTD in the
content, with other parser features left at the restrictive defaults.
Additional security-related adjustments have appeared in various Java releases, and are described in the Java API for XML Processing Security Guide. They include a number of configurable limits on maximum sizes and nesting depths, and limits to the set of protocols allowable for fetching external resources. Corresponding methods are provided in PL/Java's API. Also see “known limitations” below.
EntityResolver or SchemaMethods are provided to set an EntityResolver that controls how a SAX or DOM
parser resolves references to external entities, or a Schema by which a SAX
or DOM parser can validate content while parsing. Corresponding methods are
supplied in PL/Java's API, but are implemented only when operating on a
SAXSource or DOMSource (or StreamResult, affecting its validation of
the content written).
For StAX, control of resolution is done with a slightly different class,
XMLResolver, which can be set on a StAX parser as an ordinary property;
this can be done with PL/Java's setFirstMatchingProperty method.
A StAX parser cannot have a Schema directly assigned, but can be used
with a javax.xml.validation.Validator.
Complete details can be found in the API documentation.
When running on Java 9 or later, a local XML Catalog can be set up to
efficiently and securely resolve what would otherwise be external resource
references. The registration of a Catalog on a Java 9 or later parser involves
only existing methods for setting features/properties, as described
in the Catalog API documentation, and can be done with the
setFirstSupportedFeature and setFirstSupportedProperty methods
in PL/Java's Adjusting API.
When running on Java 22 or later, there is also a fallback catalog that can
satisfy requests for a small number of DTDs that are defined by the Java
platform. The behavior when this fallback resolver cannot satisfy a request
can be configured by setting the jdk.xml.jdkcatalog.resolve property, for
which, again, the setFirstSupportedProperty method can be used.
SQLXML instanceWhen a SQLXML instance is returned from a PL/Java function, or passed in to
a PL/Java ResultSet or PreparedStatement, it is used directly if it is an
instance of PL/Java's internal implementation.
However, a PL/Java function might reasonably use another JDBC driver and obtain
a SQLXML instance from a connection to some other database. If such a
‘foreign’ SQLXML object is returned from a function, or passed to a PL/Java
ResultSet or PreparedStatement, its content must first be copied to a new
instance created by PL/Java's driver. This happens transparently (but implies
that the ‘foreign’ instance must be in readable state at the time, and
afterward will not be).
The transparent copy is made by passing null as sourceClass to the foreign
object's getSource method, so the foreign object is in control of the type of
Source it will return. PL/Java will copy from a StreamSource, SAXSource,
StAXSource, or DOMSource. In the case of a StreamSource, an XML parser
will be involved, either to verify that the stream is XML, or to parse and
reserialize it if necessary to adapt its encoding to the server's. The parser
used by default will have the default, restrictive settings.
To allow adjustment of those settings, the copying operation can be invoked
explicitly through the Adjusting.XML.SourceResult class. For example, when
sx is a ‘foreign’ SQLXML object, the transparent operation
return sx;
is equivalent to
return conn.createSQLXML().setResult(Adjusting.XML.SourceResult.class)
.set(sx.getSource(null)).get().getSQLXML();
where conn is the PL/Java JDBC connection named by
jdbc:default:connection. To adjust the parser settings, as usual, adjusting
methods can be chained after the set and before the get. The explicit form
also allows passing a sourceClass other than null to the foreign object's
getSource method, if there is a reason not to let the foreign object choose
the type of Source to return.
SQLXML views of non-XML dataThere are the beginnings of a feature supporting XML views of non-XML data, so that some data types that are not XML, but are similarly tree-structured, can be manipulated in Java using Java's extensive support for XML.
StAX supportPL/Java's StAXSource supplies an XMLStreamReader that only supports the
expected usage pattern:
while ( streamReader.hasNext() )
{
streamReader.next();
/* methods that query state of the current parse event */
}
It would be unexpected to reorder that pattern so that queries of the current
event occur after hasNext but before next, and may produce
IllegalStateExceptions or incorrect results from a StAXSource supplied
by PL/Java.
StAX with TrAX (Java's transformation API)The javax.xml.transform APIs are required to accept any of a specified
four types of Source and Result: StreamSource, DOMSource, SAXSource,
or StAXSource (and their Result counterparts). However, StAX was a later
addition to the family. While TrAX is a mature and reliable transformation
API, and StAX is well suited for direct use in new code that will parse or
generate XML, the handful of internal bridge classes that were added
to the Java runtime for StAX and TrAX interoperation are not dependable,
especially when handling XML(CONTENT). When supplying a Source or Result
to a Transformer, a variant other than StAX should be chosen whenever
possible, whether PL/Java's or any other implementation.
For convenience, the SQLXML API allows passing a null value to getSource
or setResult, allowing the implementation to choose the type of Source
or Result to supply. PL/Java's implementation will never supply a StAX
variant when not explicitly requested.
The processing done “behind the curtain” to be able to handle XML(CONTENT)
and XML(DOCUMENT) form, when the form is not known in advance, can have
some visible effects when combined with the newer security limit
adjustments, or schema set on a SAX or DOM parser. For example, a very tight
setting of maxElementDepth may reveal that elements in the input are
nested one level deeper than expected, or a very tight maxXMLNameLimit may
reject a document whose expected names are all shorter. Schema validation for
some schemas and schema languages may likewise report an unexpected element
at the root of the document.
Issues with maxElementDepth or maxXMLNameLimit can be avoided by using
generous settings chosen to limit extreme resource consumption rather than
trying to set them as tightly as possible.
Problems with schema validation when assigning a Schema directly to the
SAX or DOM parser can be alleviated by using a javax.xml.validation.Validator
or ValidatorHandler instead, layered over PL/Java's parser, where it will
see the expected view of the content.