In the source directory org/postgresql/pljava/example/saxon
is example code
for XML processing functions similar to XMLCAST
, XMLEXISTS
, XMLQUERY
, and
XMLTABLE
, but using the XQuery language as the SQL/XML standard actually
specifies (in contrast to similar functions built into PostgreSQL, which support
only XPath, and XPath 1.0, at that).
The example also implements the four new string functions and one predicate
added in SQL:2006 for regular expression processing using the standardized
XQuery regular expression syntax: LIKE_REGEX
, OCCURRENCES_REGEX
, POSITION_REGEX
,
SUBSTRING_REGEX
, and TRANSLATE_REGEX
.
There is also, for completeness, an implementation of XMLTEXT
, which is
trivial and does not require an XQuery library at all, but is missing from
core PostgreSQL and easy to implement here.
This code is not built by default, because it pulls in the sizeable Saxon-HE library from Saxonica.
To include these optional functions when building the examples,
add -Psaxon-examples
to the mvn
command line.
The functions are presented as examples, not as a full implementation; for one thing, there is no test suite included to verify their conformance. Nevertheless, they are intended to be substantially usable subject to the limits described here, and testing and reports of shortcomings are welcome.
In addition to the open-source and freely-licensed Saxon-HE, the Saxon library is available in two paid editions, which implement more of the features of XQuery 3.1 than Saxon-HE does. It should be possible to drop either of those jar files in place of Saxon-HE (with a working license key) if features are needed beyond what Saxon-HE provides. Its developers publish a matrix identifying the features provided in each edition.
Wherever ISO SQL/XML requires one of these functions to accept an XQuery expression, in fact an XQuery main module will be accepted. Therefore, a query can be preceded by a prolog that declares namespaces, options, local variables and functions, etc. This may simplify porting queries from Oracle, which permits the same extension.
The simplest installation method is to use sqlj.install_jar
twice, once to
install (perhaps with the name saxon
) the Saxon-HE jar that Maven will have
downloaded during the build, and once to install the PL/Java examples jar in the
usual way (perhaps with the name examples
and with deploy => true
). The
Saxon jar will be found in your Maven repository (likely ~/.m2/repository/
unless you have directed it elsewhere) below the path net/sf/saxon
.
The function sqlj.set_classpath
is used to make installed jars available.
After installing the Saxon jar, if you installed it with the name saxon
,
add it to the class path:
SELECT sqlj.set_classpath('public', 'saxon');
This must be done before installing the examples
jar, so that its dependencies
on Saxon can be resolved.
After both jars are installed, make sure they are both on the classpath. If
the examples jar was installed with the name examples
:
SELECT sqlj.set_classpath('public', 'examples:saxon');
Note: an alternative, shorter procedure is to use
SET check_function_bodies TO off;
before loading the examples jar.
With the checking turned off, the jar can be installed even if the Saxon jar
has not been installed yet, or has not been added to the class path, so the
order of steps is less critical. Naturally, the example functions that use Saxon
will not work until it has been installed and added to the class path.
SET check_function_bodies TO off;
simply arranges that missing dependency
errors will be reported later when the functions are used, rather than when
they are created.
The XML querying and XMLTABLE
functions built into PostgreSQL get special
treatment from the SQL parser to give them syntax that is more SQLish than
an ordinary function call.
The functions provided here have to work as ordinary SQL user-defined functions, so calls to them can look a bit more verbose when written out in SQL, but in a way that can be recognized as a straightforward rewriting of the SQLish standard syntax.
For example, suppose there is a table catalog_as_xml
with a single row
whose x
column is a (respectably sized) XML document recording the stuff
in pg_catalog
. It could be created like this:
CREATE TABLE catalog_as_xml(x) AS
SELECT schema_to_xml('pg_catalog', false, true, '');
XMLQUERY
-like functionIn the syntax of the SQL/XML standard, here is a query that would return an XML
element representing the declaration of the function with the name
numeric_avg
(if PostgreSQL really had the standard XMLQUERY
function built
in):
SELECT XMLQUERY('/pg_catalog/pg_proc[proname eq $FUNCNAME]'
PASSING BY VALUE x, 'numeric_avg' AS FUNCNAME
RETURNING CONTENT EMPTY ON EMPTY)
FROM catalog_as_xml;
It binds the ‘context item’ of the query to x
, and the FUNCNAME
parameter to the given value, then evaluates the query and returns XML
“CONTENT” (a tree structure with a document node at the root, but not
necessarily meeting all the requirements of an XML “DOCUMENT”). It can be
rewritten as this call to the xq_ret_content
method provided here:
SELECT javatest.xq_ret_content('/pg_catalog/pg_proc[proname eq $FUNCNAME]',
PASSING => p, nullOnEmpty => false)
FROM catalog_as_xml,
LATERAL (SELECT x AS ".", 'numeric_avg' AS "FUNCNAME") AS p;
In the rewritten form, the type of value returned is determined by which
function is called, and the parameters to pass to the query are moved out to
a separate SELECT
that supplies their values, types, and names (with
the context item now given the name “.”) and is passed by its alias into the
query function.
An alert reader may notice that the example above includes a named parameter,
FUNCNAME
, and it is spelled in uppercase in the XQuery expression that uses
it, and is spelled in uppercase and quoted in the sub-SELECT
that supplies
it. The reason is an unconditional toUppercase()
in PL/Java's internal JDBC
driver, which is not anything the JDBC standard requires, but has been there
in PL/Java since 2005. For now, therefore, no matter how a parameter name is
spelled in the sub-SELECT
, it must appear in uppercase in the XQuery
expression using it, or it will not be recognized. A future PL/Java release
is highly likely to stop forcibly uppercasing the names. At that time, any code
relying on the uppercasing will break. Therefore, it is wisest, until then, to
call this function with all parameter names spelled in uppercase both in the
SQL and in the XQuery text, and on the SQL side that requires quoting the name
to avoid the conventional lowercasing done by PostgreSQL.
In the standard, parameters and results (of XML types) can be passed
BY VALUE
or BY REF
, where the latter means that the same
nodes will retain their XQuery node identities over calls (note that this is
a meaning unrelated to what “by value” and “by reference” usually mean in
PostgreSQL's documentation). PostgreSQL's implementation of the XML type
provides no way for BY REF
semantics to be implemented, so everything
happening here happens BY VALUE
implicitly, and does not need to be
specified.
XMLEXISTS
-like predicateIn the syntax of the SQL/XML standard, here is a query that would return a
boolean result indicating whether an SQL function named numeric_avg
is declared (if PostgreSQL really had the standard XMLEXISTS
function built
in):
SELECT XMLEXISTS('/pg_catalog/pg_proc[proname eq $FUNCNAME]'
PASSING BY VALUE x, 'numeric_avg' AS FUNCNAME)
FROM catalog_as_xml;
It can be rewritten as this call to the xmlexists
method provided here:
SELECT "xmlexists"('/pg_catalog/pg_proc[proname eq $FUNCNAME]',
PASSING => p)
FROM catalog_as_xml,
LATERAL (SELECT x AS ".", 'numeric_avg' AS "FUNCNAME") AS p;
As for the XMLQUERY
-like function above, , the context item and a parameter
are supplied by a separate query producing the row p
that is given as the
PASSING
argument to "xmlexists"
. The parameter name is capitalized for the
reasons explained above for the XMLQUERY
-like function.
XMLTABLE
-like functionThe function xmltable
here implements (much of) the
standard function of the same name. Because it is the same name, it has to
be either schema-qualified or double-quoted in a call to avoid confusion
with the reserved word. A rewritten form of the
first example in the PostgreSQL manual could be:
SELECT xmltable.*
FROM
xmldata,
LATERAL (SELECT data AS ".", 'not specified'::text AS "DPREMIER") AS p,
"xmltable"('//ROWS/ROW', PASSING => p, COLUMNS => ARRAY[
'data(@id)', null, 'COUNTRY_NAME',
'COUNTRY_ID', 'SIZE[@unit eq "sq_km"]',
'concat(SIZE[@unit ne "sq_km"], " ", SIZE[@unit ne "sq_km"]/@unit)',
'let $e := PREMIER_NAME
return if ( empty($e) ) then $DPREMIER else $e'
]) AS (
id int, ordinality int, "COUNTRY_NAME" text, country_id text,
size_sq_km float, size_other text, premier_name text
);
Again, the context item and a parameter (here the desired default value for
PREMIER
, passed in as the parameter DPREMIER
) are supplied by a separate
query producing the row p
that is given as "xmltable"
's PASSING
argument.
The result column names and types are now specified in the AS
list following
the function call, and the column XML Query expressions are supplied as the
COLUMNS
array. The array must have length equal to the result column AS
list (there is no defaulting an omitted column expression to an element test
using the column's name, as there is in the standard function). The array is
allowed to have one null element, marking that column FOR ORDINALITY
.
The parameter being passed into the XQuery expressions here, DPREMIER
, is
spelled in uppercase (and, on the SQL side, quoted), for the reasons explained
above for the XMLQUERY
-like function.
In the first column expression, @id
is wrapped in data()
to return the value
of the attribute, as @id
by itself would be a bare XML attribute node, outside
of any XML element. Many implementations (including the XPath-based
pseudo-XMLTABLE built in to PostgreSQL) will allow a bare attribute node in a
column expression result, and assume the attribute's value is wanted, but a
strict interpretation of the spec appears to require raising err:XPTY0004
in
that case. So, just use data()
to wrap any attribute node being returned in
a column expression.
More on that issue and the spec can be found at “About bare attribute nodes” in the code comments.
XMLCAST
-like functionThe ISO SQL XMLCAST
is used to convert XML content into a value of an SQL
data type, or an SQL value to an XML value, following the same
precisely-specified conversion rules that are used for the parameters and
results of the XMLQUERY
and XMLTABLE
functions. It can also convert from
one XML type to another, though in PostgreSQL, which has just one XML type, the
conversion is trivial. In a DBMS with support for the full set of XML types
such as XML(CONTENT)
, XML(DOCUMENT)
, and XML(SEQUENCE)
, the rules for
casting one to another are more interesting.
This ordinary-function implementation of XMLCAST
is used by rewriting an
SQL standard form like
SELECT XMLCAST(value AS wantedtype)
into a form like
SELECT result FROM (select value) as v, "xmlcast"(v) AS (result wantedtype)
where either: value is of xml
type, wantedtype is xml
, or both; in
other words, the only case XMLCAST
does not handle is where neither the input
nor result is of xml
type. Because casting XML to XML is not exciting in
PostgreSQL, the most useful cases are XML to another SQL type, or the reverse.
The SQL standard specifies a string predicate, LIKE_REGEX
, for testing a
string against an XQuery regular expression (an extension of
XML Schema regular expression syntax), and four string functions also
based on XQuery regular expressions: OCCURRENCES_REGEX
, POSITION_REGEX
,
SUBSTRING_REGEX
, and TRANSLATE_REGEX
.
The “flags” parameter to any of these can include any of the
XQuery regular expression flags s
, m
, i
, x
, and q
.
As with the XMLQUERY
and XMLTABLE
functions, some straightforward rewriting
is needed from the SQL-standard syntax into calls of these ordinary functions.
In the current implementation, all of these functions recognize newlines in the
way specified by XQuery, not the modified way specified for ISO SQL, as further
explained below after the function descriptions. To leave a clear path to a
full implementation, these versions all accept an additional parameter
w3cNewlines
, which must always be present, for now, as w3cNewlines => true
.
Specifying false
, or omitting this parameter, will mean the ISO SQL newline
treatment is wanted, and will be rejected as an unsupported feature
in this implementation.
To avoid clutter, the w3cNewlines => true
is not shown in the examples below.
LIKE_REGEX
A predicate that is true
if a string matches the regular expression.
The standard syntaxes
value LIKE_REGEX pattern
value LIKE_REGEX pattern FLAG flags
can be rewritten to
like_regex(value, pattern)
like_regex(value, pattern, flags)
like_regex(value, pattern, flag => flags)
OCCURRENCES_REGEX
A function to count the occurrences of a pattern in a string. The count can
start from a specific position in the string (the first character has
position 1), and the position can be counted using Unicode characters, or using
octets of the string's encoded form. For now, only USING CHARACTERS
is
implemented, which can be indicated by passing usingOctets => false
or
simply omitting it, as false
is the default. Standard syntax examples like
OCCURRENCES_REGEX(pattern IN str)
OCCURRENCES_REGEX(pattern FLAG flags IN str)
OCCURRENCES_REGEX(pattern IN str FROM position USING CHARACTERS)
can be rewritten to
occurrences_regex(pattern, str)
occurrences_regex(pattern, flag => flags, "in" => str)
occurrences_regex(pattern, str, "from" => position)
POSITION_REGEX
A function to return the position of a regular expression match in a string,
which can optionally return the position of a specific occurrence of the match
(the first, if not specified), or of a particular capturing group within the
desired match. The position reported can be of the first character of the match
of interest (START
), or of the first character following the match (AFTER
).
As for OCCURRENCES_REGEX
, all positions can be expressed USING CHARACTERS
or
USING OCTETS
, but only the default USING CHARACTERS
is implemented here.
Standard syntax examples like
POSITION_REGEX(START pattern IN str)
POSITION_REGEX(AFTER pattern IN str)
POSITION_REGEX(START pattern IN str OCCURRENCE n)
POSITION_REGEX(START pattern IN str OCCURRENCE n GROUP m)
POSITION_REGEX(START pattern IN str FROM pos OCCURRENCE n GROUP m)
can be rewritten to
position_regex(pattern, str)
position_regex(pattern, str, after => true)
position_regex(pattern, str, occurrence => n)
position_regex(pattern, str, occurrence => n, "group" => m)
position_regex(pattern, str, "from" => pos, occurrence => n, "group" => m)
The result is always relative to the start of the string, not the starting
position. That is, POSITION_REGEX('d' IN 'abcdef' FROM 3)
is 4, not 2.
SUBSTRING_REGEX
Returns the substring that matched the regular expression, or a specific occurrence of the expression, or a specific capturing group within the desired occurrence. Standard syntax examples like
SUBSTRING_REGEX(pattern IN str)
SUBSTRING_REGEX(pattern FLAG flags IN str)
SUBSTRING_REGEX(pattern IN str FROM position)
SUBSTRING_REGEX(pattern IN str OCCURRENCE n GROUP m)
can be rewritten to
substring_regex(pattern, str)
substring_regex(pattern, flag => flags, "in" => str)
substring_regex(pattern, str, "from" => position)
substring_regex(pattern, str, occurrence => n, "group" => m)
TRANSLATE_REGEX
Returns a string built from the input string by replacing one specified
occurrence, or all occurrences, of a matching pattern. The
replacement text can include $0
to include the entire substring
that matched, or $
n for n a digit 1 through 9,
to include what matched a capturing group in the pattern.
The default behavior of replacing all occurrences applies when
occurrence
is not specified.
Standard syntax examples like
TRANSLATE_REGEX(pattern IN str WITH repl)
TRANSLATE_REGEX(pattern IN str WITH repl OCCURRENCE n)
TRANSLATE_REGEX(pattern FLAG flags IN str WITH repl)
TRANSLATE_REGEX(pattern IN str WITH repl FROM position)
can be rewritten to
translate_regex(pattern, str, "with" => repl)
translate_regex(pattern, str, "with" => repl, occurrence => n)
translate_regex(pattern, flag => flags, "in" => str, "with" => repl)
translate_regex(pattern, str, "with" => repl, "from" => position)
A standard XQuery library provides regular expressions that follow the W3C
XQuery rules for newline recognition, in which the ^
and $
anchors
recognize only the LINE FEED
character, U&'\000a'
, the .
metacharacter
in non-dotall
mode matches anything other than a LINE FEED
or
CARRIAGE RETURN
U&'\000d'
, the \s
multicharacter escape matches only
those two characters plus space and horizontal tab, and \S
is the exact
complement of \s
.
The ISO SQL specification for these XQuery regular expression features
contains a modification of those rules to conform instead to
Unicode Technical Standard 18 rule 1.6, in which several more
Unicode characters are recognized as line boundaries, plus the two-character
sequence CARRIAGE RETURN
LINE FEED
(which counts only as one line boundary).
The modified meaning of \S
becomes “any single character that is not matched
by a single character that matches” \s
(emphasis added), leaving it no
longer the exact complement of \s
.
It is difficult to implement the ISO SQL behavior over a standard XQuery
library, so this implementation, for now, does not do so. All of these
functions implement the standard W3C XQuery behavior, which can be “requested”
by passing w3cNewlines => true
. Without w3cNewlines => true
,
the call will be interpreted as intending the ISO SQL behavior, and an
SQLFeatureNotSupportedException
(SQLSTATE 0A000
) will be raised.
The Saxon XQuery library, implemented in Java, offers the ability to use Java
regular expressions rather than XQuery ones, by passing a flag argument
that ends with ;j
(an invalid flag string per the XQuery spec). This should
not be used in code that intends to be standards-conformant or to run on another
DBMS or XQuery library, but can be useful in some cases for features that Java
regular expressions offer (such as lookahead and lookbehind predicates) that
XQuery regular expressions do not.
This example implementation of TRANSLATE_REGEX
will detect when a Java
expression rather than an XQuery one is being used, and will then permit
replacement of a zero-length match, rather than raising error 2201U
as the
standard requires. As Java regular expressions include zero-width lookahead and
lookbehind operators, a Java regex can usefully locate zero-width sites for
replacements to be applied.
There are still subtleties involved. A site that is identified by
negative lookahead or lookbehind operators ((?!)
and (?<!)
) will be
replaced as expected, but if the positive forms were used ((?=)
and (?<=)
),
the replacement will not occur. This example might be expected to insert !
for the empty string between o
and b
, but does not:
SELECT translate_regex('(?<=o)(?=b)', 'foobar', "with" => '!',
flag => ';j', w3cNewlines => true);
translate_regex
-----------------
foobar
The reason is that the specification of TRANSLATE_REGEX
is as if the
matched substring, here an empty string, is matched again in isolation
against the original regex to do the replacement, and that empty string no
longer has the o
and b
that the original lookbehind and lookahead matched.
It can be made to work by adding an alternative that matches a truly empty
string (\A\z
in Java syntax):
SELECT translate_regex('(?<=o)(?=b)|\A\z', 'foobar', "with" => '!',
flag => ';j', w3cNewlines => true);
translate_regex
-----------------
foo!bar
That workaround would also cause the replacement to happen if the input string is completely empty to start with, which might not be what's wanted.
The desugared syntax shown above can be used in PostgreSQL versions as old
as 9.5. In 9.4 and 9.3, the same syntax, but with =>
replaced by :=
for
the named parameters, can be used. The functions remain usable in still
earlier PostgreSQL versions, but with increasingly convoluted SQL syntax
needed to call them; before 9.3, for example, there was no LATERAL
in a
SELECT
, and a function could not refer to earlier FROM
items. Before 9.0,
named-parameter notation can't be used in function calls. Before 8.4, the
functions would have to be declared without their DEFAULT
clauses and the
IntervalStyle
settings, and would not work with PostgreSQL interval values.
Saxon is a large library, and benefits greatly from precompilation into a memory-mappable persistent cache, using the application class data sharing feature in Oracle Java or in OpenJDK with Hotspot, or the class sharing feature in OpenJDK with OpenJ9.
The OpenJ9 feature is simpler to set up. Because it can cache classes straight from PL/Java installed jars, the setup can be done exactly as described above, and the OpenJ9 class sharing, if enabled, will just work. OpenJ9 class-sharing setup instructions are here.
The Hotspot AppCDS
feature is more work to set up, and can only cache classes
on the JVM system classpath, so the Saxon jar would have to be installed on
the filesystem and named in pljava.classpath
instead of simply installing it
in PL/Java. It also needs to be stripped of its jarsigner
metadata, which the
Hotspot AppCDS
can't handle. Hotspot AppCDS
setup
general instructions are here, and specific details for setting up
this example for AppCDS
can be found on the
performance-tuning wiki page in the section devoted to it.
A comparison shown on that performance-tuning page appears to give Hotspot a significant advantage for a Saxon-heavy workload, so the more complex Hotspot setup may remain worthwhile as long as that comparison holds.
The AppCDS
feature in Oracle Java is still (when last checked) a commercial
feature, not to be used in production without a specific license from Oracle.
OpenJDK, as of Java 10, ships Hotspot with the same feature included, without
the encumbrance.