PL/Java functions before 1.5.1 have been able to access a value of XML type as a String
object. That has been workable, but an extra burden if porting code that used the JDBC 4.0 java.sql.SQLXML
API, and with notable shortcomings.
PostgreSQL stores XML values serialized according to server_encoding
, and depending on that setting, conversion to a Java String
can involve transcoding.
XML has rules to handle characters that may be representable in one encoding but not another, but the String
conversion is unaware of them, and may fail to produce a transcoding that represents the same XML value.
While a database design using XML may be such that each XML datum is individually very small, it is also easy to store—or generate in queries—large XML values. When mapped to a Java String
, such an XML value must have its full, uncompressed, character-serialized size allocated on the Java heap and be copied there from native memory, before the Java code even begins to make use of it. Even in cases where the Java processing to be done could be organized to stream through parse events in constant-bounded memory, the String
representation forces the entire XML value to occupy Java memory at once. Any tuning of PL/Java’s heap size allowance could have to consider a worst-case estimate of that size, or risk failures at run time.
java.sql.SQLXML
APIPL/Java 1.5.1 adds support for this API. A Java parameter or return type in a a PL/Java function can be declared to be SQLXML
, and such objects can be retrieved from ResultSet
and SQLInput
objects, and used as PreparedStatement
parameters or in SQLOutput
and updatable ResultSet
objects.
SQLXML
objectAn SQLXML
instance can have the “conceptual states” readable and not readable, writable and not writable. In PL/Java, an instance passed in as a parameter to a function, or retrieved from a ResultSet
, is readable and not writable, and can be used as input to Java processing using any of the following methods:
getBinaryStream()
InputStream
with the raw, byte-stream-serialized XML, which will have to be passed to an XML parser. The parser will have to determine the encoding used from the declaration at the start of the stream, or assume UTF-8 if there is none, as the standard provides.getCharacterStream()
getBinaryStream
but as a stream of Java characters, with the underlying encoding already decoded. May be convenient for use with parsing code that isn’t able to recognize and honor the encoding declaration, but any standard XML parser would work as well from getBinaryStream
, which should be preferred when possible. A parser working from the binary stream is able to handle transcoding, if needed, in an XML-aware way. With this method, any needed transcoding is done without XML-awareness to produce the character stream.getString()
String
. Has the same memory footprint and encoding implications discussed for the legacy conversion to String
, but may be convenient for some purposes or for values known to be small.getSource(javax.xml.transform.stream.StreamSource.class)
Source
object, directly usable with Java XML transformation APIs.getSource(javax.xml.transform.sax.SAXSource.class)
Source
object that presents the XML in parsed form via the SAX API, where the caller can register callback methods for XML constructs of interest, and then have Java stream through the XML value, calling those methods.getSource(javax.xml.transform.sax.StAXSource.class)
Source
object that presents the XML in parsed form via the StAX API, where the value can be streamed through by calling StAX pull methods to get one XML construct at a time. Java code written to this API can more clearly reflect the expected structure of the XML document, compared to code written in the callback style for SAX.getSource(javax.xml.transform.sax.DOMSource.class)
Source
object presenting the XML fully parsed as a navigable, in-memory DOM tree.getSource(null)
Source
object of a type chosen by the implementation. Useful when the Source
object will be passed to a standard Java transformation API, which can handle any of the above forms, letting the SQLXML
implementation choose one that it implements efficiently.Exactly one of these methods can be called exactly once on a readable SQLXML
object, which is thereafter not readable. (The not readable state prevents a second call to any of the getter methods; it does not, of course, prevent reading the XML content through the one stream, String
, or Source
obtained from the getter method that was just called.)
Except in the String
or DOM form, which bring the entire XML value into Java memory at once, the XML content is streamed directly from native PostgreSQL memory as Java code reads it, never accumulating in the Java heap unless that is what the application code does with it. Java heap sizing, therefore, can be based on just what the application Java code will do with the data.
The most convenient API to use in an application will often be SAX or StAX, in which the code can operate at the level of already-parsed, natural XML constructs. Code designed to work with a navigable DOM tree can easily obtain that form (but it should be understood that DOM will pull the entire content into Java memory at once, in a memory-hungry form that can easily be twenty times the size of the serialized value).
SQLXML
objectTo obtain a readable instance, declare java.sql.SQLXML
as the type of a function parameter where PostgreSQL will pass an XML argument, or use the getSQLXML
or getObject(..., SQLXML.class)
methods on a ResultSet
, or the readSQLXML
or readObject(SQLXML.class)
methods on SQLInput
. A fully JDBC-4.0 compliant driver would also return SQLXML
instances from the non-specific getObject
and readObject
methods, but in PL/Java, those have historically returned String
. Because 1.5.1 is not a major release, their behavior has not changed, and the more-specific methods must be used to obtain SQLXML
instances.
SQLXML
objectPL/Java will supply an empty SQLXML
instance that is writable and not readable via the Connection
method createSQLXML()
. It can be used as an output destination for any of several Java XML APIs, through a selection of set...
methods exactly mirroring the available get...
methods described above.
The API is unusual: except for setString
, which takes a String
parameter and returns void
as a typical “setter” method would, the other setter methods are used for the object they return—an OutputStream
, Writer
, or Result
—which the calling code should then use to add content to the XML value.
Exactly one setter method can be called exactly once on a writable SQLXML
object, which is thereafter not writable. (The not writable state prevents a second call to any setter method; XML content must still be written via the stream or Result
obtained from the one setter that was just called, except in the case of setString
, which populates the value at once.) Content being written to the SQLXML
object is accumulated in PostgreSQL native memory, not the Java heap.
A SQLXML
object, once it has been fully written and closed, can be returned from a Java function, passed as a PreparedStatement
parameter to a nested query, or stored into writable ResultSet
s used for composite function or trigger results. It can be used exactly once in any of those ways, which transfer its ownership back to PostgreSQL, leaving it inaccessible from Java.
SQLXML
object is considered closedA writable SQLXML
object cannot be presented to PostgreSQL before it is closed to confirm that writing is complete. (One written by setString
is considered written, closed, and ready to use immediately.)
When it is written using a stream obtained from setBinaryStream
, setCharacterStream
, or setResult(javax.xml.transform.stream.StreamResult.class)
, it is considered closed when the stream’s close
method is called. This will typically not be done by a Java Transformer
with the stream as its result, and so should be explicitly called after such a transformation completes.
When written using a SAXResult
, it is considered closed when the ContentHandler
’s endDocument
method is called, and when written using a StAXResult
, it is considered closed when the XMLStreamWriter
’s writeEndDocument
method is called. When one of these flavors of Result
is used with a Java Transformer
, these methods will have been called in the normal course of the transformation, so nothing special needs to be done after the transformation completes.
What it means to close
a DOMResult
is murkier. The application code must call the DOMResult
’s setNode
method, passing what will be the root node of the result document. This can be done before or after (or while) child nodes and content are added to that node. However, to avoid undefined behavior, application code must make no further modification to that DOM tree after the SQLXML
object has been presented to PostgreSQL (whether via a PreparedStatement
set
method, ResultSet
update
method, SQLOutput
write
method, or returned as the function result).
Result
object as a Transformer
resultClasses that extend javax.xml.transform.Transformer
will generally accept any flavor of Result
object and select the right API to write the transformation result to it. There is often no need to care which Result
flavor to provide, so it is common to call setResult(null)
to let the SQLXML
implementation itself choose a flavor based on implementation-specific efficiency considerations.
In the case of a DOMResult
, if the Result
object is simply passed to a Transformer
without calling setNode
first, the Transformer
itself will put an empty Document
node there, which is then populated with the transformation results.
A Document
node, however, enforces conformance to the strict rules of XML(DOCUMENT)
form (described below). If the content to be written will conform only to the looser rules of XML(CONTENT)
form, application code should call setNode
supplying an empty DocumentFragment
node, before passing the Result
object to a Transformer
.
The flavor of Result
returned by setResult(null)
will never (in PL/Java) be DOMResult
.
SQLXML
object as a written oneThe general rule that only a writable instance (that has been written and closed) can be used as a function result, or passed into a nested query, admits one exception, allowing a readable instance that Java code has obtained but not read. That makes it simple for Java code to obtain an SQLXML
instance passed in as a parameter, or from a query, and use it directly as a result or a nested-query parameter. Any one instance can be used this way no more than once.
XML(DOCUMENT)
and XML(CONTENT)
An XML value in SQL can have the type XML(DOCUMENT)
or XML(CONTENT)
(as those are defined in the ISO SQL standard, 2006 and later), which PostgreSQL does not currently treat as distinguishable types. The DOCUMENT
form must have exactly one root element, may have a document-type declaration (DTD), and has strict limits on where other constructs (other than comments and processing instructions) can occur. A value in CONTENT
form may have no root element, or more than one element at top level, and other constructs such as character data outside of a root element where DOCUMENT
form would not allow them.
Java code using a readable SQLXML
instance as input should be prepared to encounter either form (unless it has out-of-band knowledge of which form will be supplied). If it requests a DOMSource
, getNode()
will return a Document
node, if the value met all the requirements for DOCUMENT
, or a DocumentFragment
node, if it was parsable as CONTENT
. Java code requesting a SAXSource
or StAXSource
should be prepared to handle a sequence of constructs that might not be encountered when parsing a strictly conforming DOCUMENT
. Java code that requests an InputStream
, Reader
, String
, or StreamSource
will be on its own to parse the data in whichever form appears.
In DOCUMENT
form, any whitespace outside of the single root element is considered markup, not character data. When the value is parsable as DOCUMENT
, and read through PL/Java’s SAXSource
or StAXSource
, no whitespace that occurs outside of the root element will be reported to the application. PL/Java’s DOMSource
will present a Document
node with no whitespace text-node children outside of the root element.
If the value parses as CONTENT
, PL/Java’s DOMSource
will present a DocumentFragment
node with all character data, including whitespace, preserved. The streaming operation of the SAXSource
and StAXSource
is more complicated, and lossy for whitespace (only if it occurs outside of any element) ahead of the first parse event that would not be possible in DOCUMENT
form. All whitespace beyond that point is preserved.
Java code using a writable SQLXML instance to produce a result may write either DOCUMENT
or CONTENT
form. If using DOMResult
, it must supply a DocumentFragment
node to produce a CONTENT
result, as a Document
node will enforce the DOCUMENT
requirements.
SQLXML
object has transaction lifetimeThe JDBC spec provides that an SQLXML
instance is “valid for the duration of the transaction in which it was created.” One PL/Java function can hold an SQLXML
instance (in a static or session variable or data structure), and other PL/Java functions called later in the same transaction can continue reading from or writing to it. If the transaction has committed or rolled back, those operations will generate an exception.
Once a writable SQLXML
object, or an unread, readable one, has been presented to PostgreSQL as the result of a PL/Java function or through a PreparedStatement
/ResultSet
/SQLOutput
setter method, it is no longer accessible in Java.
During a transaction, resources held by a SQLXML
object are reclaimed as soon as a readable one has been fully read, or a writable one has been presented to PostgreSQL and PostgreSQL is done with it. If application code holds a readable SQLXML
object that it determines it will not read, or a writable one it will not present to PostgreSQL, it can call the free
method to allow the resources to be reclaimed sooner than the transaction’s end.
PostgreSQL can represent large XML values in “TOASTed” form, which may be in memory but compressed (XML typically compresses to a small fraction of its serialized size), or may be a small pointer to a location in storage. A readable SQLXML
instance over a TOASTed value will not be detoasted until Java code actually begins to read it, so the memory footprint of an instance being held but not yet read is kept low.
Some of the methods by which a writable instance can be written are not XML-specific APIs, but allow arbitrary content to be written (as a String
, Writer
, or OutputStream
). When written by those methods, type safety is upheld by verifying that the written content can be successfully reparsed, accepting either DOCUMENT
or CONTENT
form.
It remains possible to declare the Java type String
for function parameters and returns of XML type, and to retrieve and supply String
for ResultSet
columns and PreparedStatement
parameters of XML type. This legacy mapping from String
to XML uses PostgreSQL’s xml_in
function to verify the form of a String
from Java. That function may reject some valid values if the server configuration variable xmloption
is not first set to DOCUMENT
or CONTENT
to match the type of the value.
Java’s XML APIs support validation using a choice of schema languages; support for XML Schema 1.0 is included in the Java runtime, and implementations of others can be placed on the class path.
A schema
method is available through the “Extended API to configure XML parsers” described below, but will only work on a SAXSource
or DOMSource
(or a StreamResult
, which uses a SAX parser to validate the stream written). Other limitations are described under “known limitations” below.
More flexibly, javax.xml.validation.Validator
or javax.xml.validation.ValidatorHandler
can be used in more situations and with fewer limitations.
In symmetry to using Java String
for SQL XML types, PL/Java allows the Java SQLXML
type to be used with PostgreSQL data of type text
. This allows full use of the Java XML APIs even in PostgreSQL instances built without XML support. All of the SQLXML
behaviors described above also apply in this usage.
If a readable SQLXML
instance obtained from a text
value is directly used to set or return a value of PostgreSQL’s XML type, the XML-ness of the content is verified.
java.sql.SQLXML
APIRetrieving or verifying the XML content in an SQLXML
object can involve applying an XML parser. The full XML specification includes features that can require an XML parser to retrieve external resources or consume unexpected amounts of memory. The full feature support may be an asset in an environment where the XML content will always be from a known, trusted source, or a liability if less is known about the XML content being processed.
The Open Web Application Security Project (OWASP) advocates for the default use of settings that strictly limit the related features of Java XML parsers, as outlined in a “cheat sheet” the organization publishes.
However, the recommended defaults really are severely restrictive (for example, disabling document-type declarations by default will cause PL/Java’s SQLXML
implementation to reject all XML values that contain DTDs). Therefore, there must be a simple and clear way for code to selectively adjust the settings, or adopting the strictest settings by default would pose an unacceptable burden to developers.
The traditional Java way to adjust the XML parser is overwhelmingly fiddly, involving setFeature
or setProperty
calls that identify the feature to be set by passing an arcane URI that might be found in the documentation, or the cheat sheet, or cargo-culted from some other code base. In some cases, the streamlined SQLXML
API conceals the steps where adjustments would have to be applied. With no better way to adjust the parser, it would be an unrealistic developer burden to adopt the restrictive defaults and expect the developer to relax them.
Therefore, PL/Java has an extension API documented at the org.postgresql.pljava.Adjusting.XML class. With the API, it is possible to obtain a Source
object from an SQLXML
instance sqx
in either the standard or extended way shown in this example for a SAXSource
:
SAXSource src = sqx.getSource(SAXSource.class); // OR SAXSource src = sqx.getSource(Adjusting.XML.SAXSource.class) .allowDTD(true).get();
The first form would obtain a SAXSource
configured with the restrictive, OWASP-recommended defaults, which would reject any content with a DTD. The second form would obtain a SAXSource
configured to allow a DTD in the content, with other parser features left at the restrictive defaults.
Additional security-related adjustments have appeared in various Java releases, and are described in the Java API for XML Processing Security Guide. They include a number of configurable limits on maximum sizes and nesting depths, and limits to the set of protocols allowable for fetching external resources. Corresponding methods are provided in PL/Java’s API. Also see “known limitations” below.
EntityResolver
or Schema
Methods are provided to set an EntityResolver
that controls how a SAX or DOM parser resolves references to external entities, or a Schema
by which a SAX or DOM parser can validate content while parsing. Corresponding methods are supplied in PL/Java’s API, but are implemented only when operating on a SAXSource
or DOMSource
(or StreamResult
, affecting its validation of the content written).
For StAX, control of resolution is done with a slightly different class, XMLResolver
, which can be set on a StAX parser as an ordinary property; this can be done with PL/Java’s setFirstMatchingProperty
method.
A StAX parser cannot have a Schema
directly assigned, but can be used with a javax.xml.validation.Validator
.
Complete details can be found in the API documentation.
When running on Java 9 or later, a local XML Catalog can be set up to efficiently and securely resolve what would otherwise be external resource references. The registration of a Catalog on a Java 9 or later parser involves only existing methods for setting features/properties, as described in the Catalog API documentation, and can be done with the setFirstSupportedFeature
and setFirstSupportedProperty
methods in PL/Java’s Adjusting
API.
SQLXML
instanceWhen a SQLXML
instance is returned from a PL/Java function, or passed in to a PL/Java ResultSet
or PreparedStatement
, it is used directly if it is an instance of PL/Java’s internal implementation.
However, a PL/Java function might reasonably use another JDBC driver and obtain a SQLXML
instance from a connection to some other database. If such a ‘foreign’ SQLXML
object is returned from a function, or passed to a PL/Java ResultSet
or PreparedStatement
, its content must first be copied to a new instance created by PL/Java’s driver. This happens transparently (but implies that the ‘foreign’ instance must be in readable state at the time, and afterward will not be).
The transparent copy is made by passing null
as sourceClass
to the foreign object’s getSource
method, so the foreign object is in control of the type of Source
it will return. PL/Java will copy from a StreamSource
, SAXSource
, StAXSource
, or DOMSource
. In the case of a StreamSource
, an XML parser will be involved, either to verify that the stream is XML, or to parse and reserialize it if necessary to adapt its encoding to the server’s. The parser used by default will have the default, restrictive settings.
To allow adjustment of those settings, the copying operation can be invoked explicitly through the Adjusting.XML.SourceResult
class. For example, when sx is a ‘foreign’ SQLXML
object, the transparent operation
return sx;
is equivalent to
return conn.createSQLXML().setResult(Adjusting.XML.SourceResult.class) .set(sx.getSource(null)).get().getSQLXML();
where conn is the PL/Java JDBC connection named by jdbc:default:connection
. To adjust the parser settings, as usual, adjusting methods can be chained after the set
and before the get
. The explicit form also allows passing a sourceClass
other than null
to the foreign object’s getSource
method, if there is a reason not to let the foreign object choose the type of Source
to return.
SQLXML
views of non-XML dataThere are the beginnings of a feature supporting XML views of non-XML data, so that some data types that are not XML, but are similarly tree-structured, can be manipulated in Java using Java’s extensive support for XML.
StAX
supportPL/Java’s StAXSource
supplies an XMLStreamReader
that only supports the expected usage pattern:
while ( streamReader.hasNext() ) { streamReader.next(); /* methods that query state of the current parse event */ }
It would be unexpected to reorder that pattern so that queries of the current event occur after hasNext
but before next
, and may produce IllegalStateException
s or incorrect results from a StAXSource
supplied by PL/Java.
StAX
with TrAX
(Java’s transformation API)The javax.xml.transform
APIs are required to accept any of a specified four types of Source
and Result
: StreamSource
, DOMSource
, SAXSource
, or StAXSource
(and their Result
counterparts). However, StAX
was a later addition to the family. While TrAX
is a mature and reliable transformation API, and StAX
is well suited for direct use in new code that will parse or generate XML, the handful of internal bridge classes that were added to the Java runtime for StAX
and TrAX
interoperation are not dependable, especially when handling XML(CONTENT)
. When supplying a Source
or Result
to a Transformer
, a variant other than StAX
should be chosen whenever possible, whether PL/Java’s or any other implementation.
For convenience, the SQLXML
API allows passing a null value to getSource
or setResult
, allowing the implementation to choose the type of Source
or Result
to supply. PL/Java’s implementation will never supply a StAX
variant when not explicitly requested.
The processing done “behind the curtain” to be able to handle XML(CONTENT)
and XML(DOCUMENT)
form, when the form is not known in advance, can have some visible effects when combined with the newer security limit adjustments, or schema
set on a SAX or DOM parser. For example, a very tight setting of maxElementDepth
may reveal that elements in the input are nested one level deeper than expected, or a very tight maxXMLNameLimit
may reject a document whose expected names are all shorter. Schema validation for some schemas and schema languages may likewise report an unexpected element at the root of the document.
Issues with maxElementDepth
or maxXMLNameLimit
can be avoided by using generous settings chosen to limit extreme resource consumption rather than trying to set them as tightly as possible.
Problems with schema validation when assigning a Schema
directly to the SAX or DOM parser can be alleviated by using a javax.xml.validation.Validator
or ValidatorHandler
instead, layered over PL/Java’s parser, where it will see the expected view of the content.