Package org.gjt.cuspy

Class JarX

  • Direct Known Subclasses:
    JarX.Build, Node

    public class JarX
    extends Object
    Distribute your work as a self-extracting jar file by including one file, JarX.class, that also safely converts text files to the receiver's encoding and newline conventions, and adds less than 7 kB to your jar.

    A self-extracting file is handy if your recipient might have a Java runtime environment but not the jar tool. The text conversion offered by JarX is useful if your distribution will include text files, source, documentation, scripts, etc., and your recipients have platforms with different newline conventions.

    Text conversion background

    There are two issues in the cross-platform delivery of text files.
    1. Different platforms indicate the end of a line differently. The UNIX convention uses the single character LINE FEED; the (old) Macintosh used only the CARRIAGE RETURN character, and DOS/Windows systems require every line to end with a CARRIAGE RETURN followed by a LINE FEED. If some conversion isn't done, a Windows file appears to have garbage characters at the ends of lines if moved to UNIX, or the beginnings of lines if moved to a Mac; UNIX and Mac files moved to Windows, or Mac files moved to UNIX, appear to be squished into one insanely long line. These effects can complicate viewing and editing the files, and interfere with automated processes like diff or version control.
    2. Different platforms may use different default character encodings. Ideally, text files within a jar should be extracted into the local encoding.

    It's important to apply such transformations only to the files within the archive that are actually known to contain text. Passing binary data or class files through character and newline transformations will corrupt them.

    The ZIP approach and why it loses

    The popular zip format on which jar is based already has a provision for newline (but not character set) conversion. Each entry includes a text/binary bit, and the unzip program applies newline conversion while extracting, but only to the files flagged as text.

    One problem, though not the fatal one, with this scheme is that there is no single convention for newlines inside the zip file. Instead, files are stored just as they are found on the source system, and a code indicating the source operating system is stored in the archive. The receiving unzip program must interpret the code and know what newline convention that operating system used, in order to convert newlines appropriately.

    The fatal flaw, however, has to do with the way the text/binary bit gets set in the first place. While building the archive, the common zip programs look at statistical properties of byte frequencies in the input, and set the text bit for any entry that looks like it might be text! If a binary file happens to contain an unlucky sequence of bytes, it will be flagged as text and then silently corrupted by any unzip program that honors the text bit. That can happen, and has happened, to class files in zip archives if the recipient uses unzip -a, and causes significant misery if the package is widely distributed.

    A better way

    Even though the jar format is based on zip, it would be a mistake to make jar tools that rely on the zip text/binary bit, because common practice has made that bit unreliable. What's needed is a standard way for the developer to explicitly indicate the processing needed for each entry in the jar. Also, a single representation should be adopted for newlines in text files inside a jar, so an extracting program only needs to convert from that representation to the local one, and does not need to concern itself with details of the system where the jar was created.

    As of JDK 1.3, Sun has extended the Jar File Specification to allow a Content-Type in the Manifest for each jar entry. The value of Content-Type is a MIME type, and with this a developer can specify exactly which entries in a jar should be treated as text. The question of a standard representation for newlines inside the jar is settled, because [RFC2046 section 4.1.1] establishes a canonical line break representation for all subtypes of the text MIME type. Therefore, correct translation of line breaks from any platform to any platform can be achieved if a jar-building program just converts from its local convention to the canonical CRLF form, and a jar extraction program just converts the canonical to its own local form. Neither program needs to know anything about the other environment. Finally, the charset parameter of the text type allows explicit specification of the character encoding used in a jar entry, and the extracting program can automatically convert into the encoding used on the local system. (But see Call to action below.)

    What JarX Does

    Content-Type entries in a Manifest were introduced in Java 1.3 but are compatible with earlier jar specifications; a jar file containing such entries can be processed without any trouble by any jar tool compliant with the old or new standard. However, there is not yet a full jar tool available that will honor the content types and do automatic transformation of text entries. To fill the need until that functionality is added to the widely-available jar tools, JarX is available now.

    JarX.Build produces a jar, working from a manifest file prepared by the developer. Entries with any text type will be translated from the local encoding into the specified charset if given, and entries with the specific type text/plain will have their line endings converted to the CRLF canonical form. Line endings are left alone for all other subtypes of text, but this decision is open to comment.

    The file produced by JarX.Build is a fully compliant jar and can be unpacked by any jar or unzip tool, but current tools will not automatically convert the text files to the local conventions. By including the single class file JarX.class in the jar, a developer produces a self-extracting archive that can be executed to unpack itself on any Java 1.6 or later virtual machine, performing all automatic conversions and requiring no jar tool at all.

    Building a Jar

    To build a jar file, first prepare the manifest, using any text editor or, more likely, a script. Include a Name: entry for every file to be included in the jar. JarX.Build archives only the files named in the manifest. Be sure to include Manifest-Version: 1.0 as the first line of the manifest; JarX.Build does not do it for you. To make the jar self-extracting, make the next line
    Main-Class: org.gjt.cuspy.JarX
    and be sure to include a Name: entry for org/gjt/cuspy/JarX.class.

    Add an appropriate Content-Type: line after the Name: line for every entry that needs one. JarX itself only distinguishes the text types from nontext (everything else), and treats a missing Content-Type: as nontext, so for purposes of JarX you only need to add content types for text files. For other purposes you may wish to include the types of other entries as well. In the simplest case, just omit content types for your non-text files, and add Content-Type: text/plain; charset=UTF-8 for files that you want auto-converted. Then give the command
    java org.gjt.cuspy.JarX$Build foo.jar manifest
    if manifest is the name of your prepared manifest file and foo.jar names the jar you want to create. The order of files in the jar will be the order of their names in the manifest.

    Special manifest attributes

    For 2016, JarX now recognizes some special manifest attributes:
    _JarX_CharsetInArchive
    As a per-entry attribute, identifies the character set of the associated text member as stored in the archive. This is entirely equivalent to the earlier method using ;charset= on the Content-Type attribute, which JarX still supports, but has not been widely adopted. As a main attribute, sets a default for any text members without a per-entry value.
    _JarX_CharsetWhenUnpacked
    As a per-entry attribute, identifies the character set of the associated text member when not in the archive. At Build time, the member will be transcoded from this charset (instead of the platform's default) to the specified InArchive charset, and, on extraction, will be transcoded back to this charset regardless of the platform's default encoding. This attribute can be used for files conforming to specifications that define a fixed encoding. In other cases, omitting this attribute allows the member to be extracted into the receiving platform's default charset. As a main attribute, sets a default for text members without a per-entry value.
    _JarX_Permissions
    As a per-entry attribute, declares permissions to apply to the extracted file. (At present, not applied to directories.) Only the Java SE 6 File permissions are supported, a small subset of what most platforms support. A comma-separated list of usage=bywhom, where usage can be read, write, or execute and bywhom can be none, owner, or all. As a main attribute, sets a default for members without a per-entry attribute. For any usage that is left unspecified, no File method will be called to change that permission, so the system's defaults will apply.
    _JarX_PathResolver
    Only recognized as a main attribute, this specifies a script that JarX will invoke for every archive member, with the following bindings in scope:
    properties
    The Java system Properties object.
    storedPath
    The full pathname of the member, exactly as stored in the archive.
    platformPath
    The full pathname after only replacing the / separator character with the platform's file.separator if different.
    computedPath
    Initially the same as platformPath. If the script stores a new value in computedPath, the member will be extracted to that full path.
    The script is given as the value of this attribute, using the same RFC822-ish lexical conventions the jar spec says it was "inspired by". The value must begin with a MIME type (two atoms separated by a slash, as in application/javascript, followed by at least one QUOTEDSTRING. RFC822 uses the double-quote for this purpose, and backslash to escape it when needed, which also means you must double any backslash intended for the script. Additional QUOTEDSTRINGs simply append to the script. The RFC822 line-continuation rule can be exploited by supplying the script as multiple quoted strings, one per line, each indented by a space. The strings are appended with nothing in between (so, the continuation newlines do not become newlines in the script), but a / can appear between any two quoted strings to insert an explicit newline in the script. In addition to whatever comment syntax is allowed in the scripting language, RFC822 comments (marked by parentheses, and nestable) are allowed outside of the quoted strings.

    Alternative to ScriptEngine for a path resolver

    With the removal of Nashorn in Java 15, leaving no scripting language that can be assumed present in the Java runtime, a script in the manifest may no longer be the simplest way to customize the resolution of path names when extracting. This class has been refactored now to expose two methods, prepareResolver and resolve, easily overridden in a subclass. The value of the _JarX_PathResolver main attribute is passed to prepareResolver as a string (so it can be parsed in any way useful to the subclass, not necessarily as described above, or ignored), and resolve is passed the stored path and platform path, and returns the platform path unchanged or a replacement. A self-extracting jar with resolution can be made without depending on any script engine, by placing two classes in the jar, JarX and the subclass, and naming the subclass as the jar's Main-Class. It needs a main method that simply instantiates the class and calls extract().

    Extracting a jar

    The command java -jar foo.jar is all it takes to extract a jar. The Main-Class entry in the manifest identifies the entry point of JarX so it does not need to be specified.

    Call to action

    At the moment, Sun's Jar File Specification contains a mistake in the description of a content type that could lead to implementations that reject valid content types. Squash this bug before it bites: log on to the Java Developer Connection (it's free) and cast one, two, or all three of your Bug Votes for Bug #4310708.

    Miscellany

    This class is a little sloppy and relatively slow, especially the Build side when converting plain text files. The idea for JarX is a natural outgrowth of the Java 1.3 manifest standard and I have suggested that the functionality of JarX be added into the widely available jar tools. If Sun takes the suggestion then the functionality of JarX will soon be provided by nice fast optimized tools and it won't be necessary to spend a lot of time polishing JarX.

    Error handling is roughly nonexistent. JarX is careful to avoid silent corruption of data, even verifying that all character encoding calls are successful, but makes no attempt to be graceful about errors or surprises. If something doesn't work the likely result is a one line message and abrupt exit, or an uncaught exception and stack trace.

    The coding style is a little contrived just to arrange it so JarX.class is the only file needed in the jar to make it self-extracting. In particular the JarX class is also written to serve as the class of tokens returned by the structured-field-body lexer, to avoid introducing a second class. Weird, perhaps, but harmless weird.

    Version:
    $Id$
    Author:
    Chapman Flack
    • Field Detail

      • treatment

        protected String treatment
        How to treat the entry being processed: bytes, characters, lines. Used only in the JarX instance created by main(). Set by classify(). Only the exact String instances BYTES, CHARACTERS, LINES are to be used.
      • archiveCharset

        protected Charset archiveCharset
        Charset (in archive) of the entry being processed. Used only in the JarX instance created by main(). Set by classify().
      • unpackedCharset

        protected Charset unpackedCharset
        Charset when unpacked of the entry being processed. Used only in the JarX instance created by main(). Set by classify().
      • readPermission

        protected String readPermission
        Read permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply.
      • writePermission

        protected String writePermission
        Write permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply.
      • executePermission

        protected String executePermission
        Execute permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply.
      • defaultTreatment

        protected String defaultTreatment
        As for treatment, but set from main attributes (or BYTES if not present).
      • defaultArchiveCharset

        protected Charset defaultArchiveCharset
        As for archiveCharset, but set from main attributes (default UTF-8).
      • defaultUnpackedCharset

        protected Charset defaultUnpackedCharset
        As for unpackedCharset, but set from main attributes or platform default.
      • defaultReadPermission

        protected String defaultReadPermission
        As for readPermission but set from main attributes, null if not present.
      • defaultWritePermission

        protected String defaultWritePermission
        As for writePermission but set from main attributes, null if not present.
      • defaultExecutePermission

        protected String defaultExecutePermission
        As for executePermission but set from main attributes, null if not present.
      • resolverEngine

        protected ScriptEngine resolverEngine
        Script engine to run the name resolver script, if any.
      • resolverScript

        protected String resolverScript
        The name resolver script, if any.
      • ARCHIVE_CHARSET

        public final Attributes.Name ARCHIVE_CHARSET
        Attribute name for specifying the in-archive charset. The Java powers that be didn't go for Bug #4310708 so there needs to be a dedicated manifest key for this (though JarX will still honor ;charset= on the Content-Type too).
      • UNPACKED_CHARSET

        public final Attributes.Name UNPACKED_CHARSET
        Attribute name for specifying the when-unpacked charset. This was not in the original JarX; the platform default was always used, and still is if this attribute is not present.
      • PERMISSIONS

        public final Attributes.Name PERMISSIONS
        Permissions (only as supported in java.io.File for SE 6) spec *(, spec) where spec is action=whom, action is read, write, or execute, and whom is none, owner, or all.
      • PATHRESOLVER

        public final Attributes.Name PATHRESOLVER
        Main attribute to specify a JSR223 script to control extracted names.
      • mainAttributes

        protected Attributes mainAttributes
        Main attributes saved from the manifest (which must be seen early).
      • type

        public short type
        Token type, when JarX objects are used to return content type tokens
      • value

        public String value
        Token text when JarX objects are used to return content type tokens
    • Constructor Detail

      • JarX

        public JarX()
        Public constructor for an application using JarX to unpack jars.
      • JarX

        protected JarX​(short t,
                       String v)
        Constructor for JarX objects used as tokens returned by the lexer.
        Parameters:
        t - the type of this token
        v - the corresponding text (with delimiters removed and backslashes resolved for quoted strings, domain text, and comments)
    • Method Detail

      • is

        public boolean is​(short... type)
        True if this JarX object represents a token of one of the given types.
        Parameters:
        type - allowable types
        Returns:
        as titled
      • holds

        public boolean holds​(String value,
                             short... type)
        True if this JarX object represents a token of one of the given types and its value equals the given string.
        Parameters:
        value - string value for comparison
        type - allowable types
        Returns:
        as titled
      • holdsIgnoreCase

        public boolean holdsIgnoreCase​(String value,
                                       short... type)
        True if this JarX object represents a token of one of the given types and its value equals the given string, case-insensitively.
        Parameters:
        value - string value for comparison
        type - allowable types
        Returns:
        as titled
      • main

        public static void main​(String[] args)
                         throws Exception
        The entry point for extracting.
        Parameters:
        args - argument list
        Throws:
        Exception - if anything doesn't work, punt
      • extract

        public void extract()
                     throws Exception
        Find the jar I was loaded from and extract all entries except my own class file.
        Throws:
        Exception - if anything doesn't work, punt
      • notMe

        public boolean notMe​(String name)
        True if the passed name is not the in-jar name of this class or related classes that should not be extracted.

        If not overridden, this method returns false only for names matching the class of this or any ancestral superclass. Interfaces are not considered. A subclass could apply a different policy.

      • setDefaults

        public void setDefaults​(Attributes mainAttributes)
        Examine the main attributes to set any defaults. Includes loading the required script engine if a name resolver script is given.
        Parameters:
        mainAttributes - as obtained from the manifest
      • prepareResolver

        public void prepareResolver​(String v)
                             throws Exception
        Prepare a resolver of pathnames, given the value of the PATHRESOLVER main attribute.

        If not overridden in a subclass, this method parses it as a MIME type and script as described in the class comments, loads a ScriptEngine for the MIME type, and saves references to the engine in resolverEngine and the script in resolverScript.

        Parameters:
        v - value of the _JarX_PathResolver main attribute
        Throws:
        Exception - this implementation throws no checked exceptions, but an overriding implementation may
      • resolve

        public String resolve​(String orig,
                              String plat)
                       throws Exception
        Called with every path to be extracted; returns a possibly-corrected path.

        If not overridden in a subclass, this method returns s unchanged if no resolverScript has been set, and otherwise invokes the script with storedPath bound to orig, platformPath and computedPath both bound to plat, then returns the value bound to computedPath when the script has returned.

        Parameters:
        orig - The path as stored in the archive, always /-separated
        plat - The path after only replacing / with the platform separator
        Returns:
        plat unchanged, or a corrected location for extracting the entry, or null to suppress extracting the entry
        Throws:
        Exception - this implementation may throw ScriptException, an overriding implementation may throw others
      • classify

        public void classify​(Attributes atts,
                             boolean lazy)
        Set instance variables for text/binary and permissions treatment according to the passed Attributes.
        Parameters:
        atts - Usually a per-entry attribute set, but classify is also called by setDefaults to parse the main attributes.
        lazy - In the usual case, as soon as an entry is classified as non-text, classify can return without looking for charset information. When called by setDefaults, however, laziness is not appropriate.
      • archiveCharsetFromType

        protected boolean archiveCharsetFromType​(JarX[] type)
        Parse a Content-Type for any charset parameter.
        Parameters:
        type - tokenized Content-Type value
        Returns:
        true if the Content-Type specified a charset
      • extract

        public void extract​(JarEntry je,
                            InputStream is)
                     throws Exception
        Extract a single entry, performing any appropriate conversion
        Parameters:
        je - JarEntry for the current entry
        is - InputStream with the current entry content
        Throws:
        Exception - IOException for any problem involving I/O, ScriptException possible from the non-overridden path resolver, others possible in an overridden implementation
      • shovel

        public void shovel​(InputStream is,
                           OutputStream os)
                    throws IOException
        Copy content from an input to an output stream until end. Whether the content is shoveled as bytes, characters, or lines will be determined by instance variables that have been set by calling classify before calling this method.
        Parameters:
        is - source of input
        os - destination for output
        Throws:
        IOException - for any problem involving I/O
      • shovelBytes

        public static void shovelBytes​(InputStream is,
                                       OutputStream os)
                                throws IOException
        Copy bytes from an input to an output stream until end. No character encoding or newline conversion applies.
        Parameters:
        is - source of input
        os - destination for output
        Throws:
        IOException - for any problem involving I/O
      • shovelText

        public void shovelText​(InputStream is,
                               OutputStream os)
                        throws IOException
        Copy text from an input to an output stream until end. Determines the encoding transformation to use (based on the charset content-type parameter) and whether to copy as lines (with newline conversion) or unmolested characters. text/plain is copied as lines, all other text subtypes as characters.
        Parameters:
        is - source of input
        os - destination of output
        Throws:
        IOException - for any problem involving I/O
      • shovelLines

        public void shovelLines​(InputStream is,
                                OutputStream os)
                         throws IOException
        Copy lines of text from an input from an output stream, applying the specified character encoding and translating newlines. This method handles the extracting case, where the named encoding is associated with the input stream (jar) and the platform default encoding with the output (local file), and the local line.separator is used to separate lines on the output. Overridden in build to do the reverse when building a jar. To avoid silent corruption of data, this method verifies that all characters from the jar are successfully converted to the local platform's encoding.
        Parameters:
        is - the source of input
        os - destination for output
        Throws:
        IOException - for any problem involving I/O
      • shovelChars

        public void shovelChars​(InputStream is,
                                OutputStream os)
                         throws IOException
        Copy characters of text from an input from an output stream, applying the specified character encoding but not translating newlines. This method handles the extracting case, where the named encoding is associated with the input stream (jar) and the platform default encoding with the output (local file). Overridden in build to do the reverse when building a jar. To avoid silent corruption of data, this method verifies that all characters from the jar are successfully converted to the local platform's encoding.
        Parameters:
        is - the source of input
        os - destination for output
        Throws:
        IOException - for any problem involving I/O
      • structuredFieldBody

        public static JarX[] structuredFieldBody​(String field,
                                                 int off)
        Lexical analyzer for structured field bodies as described in RFC822 and modified in RFC2045. Comments are processed and stored in tokens that are, at the last minute, excluded from the returned token list; only two lines would need to be changed to use this lexer in an application that wanted comments returned.
        Parameters:
        field - a header field
        off - offset to the start of the structured field body (skip the field name and colon)
        Returns:
        An array of tokens with any COMMENT tokens (for JarX purposes) excluded