Class Lexicals


public abstract class Lexicals
extends Object
A few useful SQL lexical definitions supplied as Pattern objects. The idea is not to go overboard and reimplement an SQL lexer, but to capture in one place the rules for those bits of SQL snippets that are likely to be human-supplied in annotations and need to be checked for correctness when emitted into deployment descriptors. Identifiers, for a start. Supplied in the API module so they are available to javac to compile and generate DDR when the rest of PL/Java is not necessarily present. Of course backend code such as SQLDeploymentDescriptor can also refer to these.
  • Field Details


      public static final Pattern ISO_REGULAR_IDENTIFIER_START
      Allowed as the first character of a regular identifier by ISO.

      public static final Pattern ISO_REGULAR_IDENTIFIER_PART
      Allowed as any non-first character of a regular identifier by ISO.

      public static final Pattern ISO_REGULAR_IDENTIFIER
      A complete regular identifier as allowed by ISO.

      public static final Pattern ISO_REGULAR_IDENTIFIER_CAPTURING
      A complete ISO regular identifier in a single capturing group.

      public static final Pattern ISO_DELIMITED_IDENTIFIER
      A complete delimited identifier as allowed by ISO. As it happens, this is also the form PostgreSQL uses for elements of a LIST_QUOTE-typed GUC.

      public static final Pattern ISO_DELIMITED_IDENTIFIER_CAPTURING
      An ISO delimited identifier with a single capturing group that captures the content (which still needs to have "" replaced with " throughout). The capturing group is named xd.

      public static final Pattern ISO_UNICODE_ESCAPE_SPECIFIER
      The escape-specifier part of a Unicode delimited identifier or string. The escape character itself is in the capturing group named uec. The group can be absent, in which case \ should be used as the uec.

      What makes this implementable as a regular expression is that what precedes/follows UESCAPE is restricted to simple white space, not the more general separator (which can include nesting comments and therefore isn't a regular language). PostgreSQL enforces the same restriction, and a bit of language lawyering does confirm it's what ISO entails. ISO says "any <token> may be followed by a <separator>", and enumerates the expansions of <token>. While an entire <Unicode character string literal> or <Unicode delimited identifier> is a <token>, the constituent pieces of one, like UESCAPE here, are not.


      public static final Pattern ISO_UNICODE_IDENTIFIER
      A Unicode delimited identifier. The body is in capturing group xui and the escape character in group uec. The body still needs to have "" replaced with ", and Unicode escape values decoded and replaced, and then it has to be verified to be no longer than 128 codepoints.

      public static final String ISO_UNICODE_REPLACER
      A compilable pattern to match a Unicode escape value. A match should have one of three named capturing groups. If cev, substitute the uec itself. If u4d or u6d, substitute the codepoint represented by the hex digits. A match with none of those capturing groups indicates an ill-formed string.

      Maka a Pattern from this by supplying the right uec, so: Pattern.compile(String.format(ISO_UNICODE_REPLACER, Pattern.quote(uec)));

      See Also:
      Constant Field Values

      public static final Pattern PG_REGULAR_IDENTIFIER_START
      Allowed as the first character of a regular identifier by PostgreSQL (PG 7.4 -).

      public static final Pattern PG_REGULAR_IDENTIFIER_PART
      Allowed as any non-first character of a regular identifier by PostgreSQL (PG 7.4 -).

      public static final Pattern PG_REGULAR_IDENTIFIER
      A complete regular identifier as allowed by PostgreSQL (PG 7.4 -).

      public static final Pattern PG_REGULAR_IDENTIFIER_CAPTURING
      A complete PostgreSQL regular identifier in a single capturing group.

      public static final Pattern ISO_AND_PG_REGULAR_IDENTIFIER
      A regular identifier that satisfies both ISO and PostgreSQL rules.

      public static final Pattern ISO_AND_PG_REGULAR_IDENTIFIER_CAPTURING
      A regular identifier that satisfies both ISO and PostgreSQL rules, in a single capturing group named i.

      public static final Pattern ISO_AND_PG_IDENTIFIER_CAPTURING
      Pattern that matches any identifier valid by both ISO and PG rules, with the presence of named capturing groups indicating which kind it is: i for a regular identifier, xd for a delimited identifier (still needing "" replaced with "), or xui (with or without an explicit uec for a Unicode identifier (still needing "" to " and decoding of Unicode escape values).

      public static final Pattern ISO_PG_JAVA_IDENTIFIER
      An identifier by ISO SQL, PostgreSQL, and Java (not SQL at all) rules. (Not called REGULAR because Java allows no other form of identifier.) This restrictive form is the safest for identifiers being generated into a deployment descriptor file that an old version of PL/Java might load, because through 1.4.3 PL/Java used the Java identifier rules to recognize identifiers in deployment descriptors.

      public static final Pattern PG_OPERATOR
      An operator by PostgreSQL rules. The length limit (NAMELEN - 1) is not applied here. The match will not include a - followed by - or a / followed by *, and a multicharacter match will not end with + or - unless it also contains one of ~ ! @ # % ^ & | ` ?.

      public static final Pattern NEWLINE
      A newline, in any of the various forms recognized by the Java regex engine, letting it handle the details.

      public static final Pattern WHITESPACE_NO_NEWLINE
      White space except newline, for any Java-recognized newline.

      public static final Pattern SIMPLE_COMMENT
      The kind of comment that extends from -- to the end of the line. This pattern does not eat the newline (though the ISO production does).

      public static final Pattern BRACKETED_COMMENT_INSIDE
      Most of the inside of a bracketed comment, defined in an odd way. It expects both characters of the /* introducer to have been consumed already. This pattern will then eat the whole comment including both closing characters if it encounters no nested comment; otherwise it will consume everything including the / of the nested introducer, but leaving the *, and the <nest> capturing group will be present in the result. That signals the caller to increment the nesting level, consume one * and invoke this pattern again. If the nested match succeeds (without again setting the <nest> group), the caller should then decrement the nest level and match this pattern again to consume the rest of the comment at the original level.

      This pattern leaves the * unconsumed upon finding a nested comment introducer as a way to end the repetition in the SEPARATOR pattern, as nothing the SEPARATOR pattern can match can begin with a *.


      public static final Pattern SEPARATOR
      SQL's SEPARATOR, which can include any amount of whitespace, simple comments, or bracketed comments. This pattern will consume as much of all that as it can in one match. There are two capturing groups that might be set in a match result: <nl> if there was at least one newline matched among the whitespace (which needs to be known to get the continuation of string literals right), and <nest> if the start of a bracketed comment was encountered.

      In the <nest> case, the / of the comment introducer will have been consumed but the * will remain to consume (as described above for BRACKETED_COMMENT_INSIDE); the caller will need to increment a nest level, consume the *, and match BRACKETED_COMMENT_INSIDE to handle the nesting comment. Assuming that completes without another <nest> found, the level should be decremented and BRACKETED_COMMENT_INSIDE matched again to match the rest of the outer comment. When that completes (without a <nest>) at the outermost level, this pattern should be matched again to mop up any remaining SEPARATOR content.

  • Method Details

    • separator

      public static boolean separator​(Matcher m, boolean significant)
      Consume any SQL SEPARATOR at the beginning of Matcher m's current region.

      The region start is advanced to the character following any separator (or not at all, if no separator is found).

      The meaning of the return value is altered by the significant parameter: when significant is true (meaning the very presence or absence of a separator is significant at that point in the grammar), the result will be true if any separator was found, false otherwise. When significant is false, the result does not reveal whether any separator was found, but will be true only if a separator was found that includes at least one newline. That information is needed for the grammar of string and binary-string literals.

      m - a Matcher whose current region should have any separator at the beginning consumed. The region start is advanced past any separator found. The Pattern associated with the Matcher may be changed.
      significant - when true, the result should report whether any separator was found or not; when false, the result should report only whether a separator containing at least one newline was found, or not.
      whether any separator was found, or whether any separator containing a newline was found, as selected by significant.
      InputMismatchException - if an unclosed /*-style comment is found.
    • identifierFrom

      public static Lexicals.Identifier.Simple identifierFrom​(Matcher m)
      Return an Identifier.Simple, given a Matcher that has matched an ISO_AND_PG_IDENTIFIER_CAPTURING. Will determine from the matching named groups which type of identifier it was, process the matched sequence appropriately, and return it.
      m - A Matcher known to have matched an identifier.
      Identifier.Simple made from the recovered string.