Package org.gjt.cuspy

Class JarX

java.lang.Object
org.gjt.cuspy.JarX
Direct Known Subclasses:
JarX.Build, Node

public class JarX
extends Object
Distribute your work as a self-extracting jar file by including one file, JarX.class, that also safely converts text files to the receiver's encoding and newline conventions, and adds less than 7 kB to your jar.

A self-extracting file is handy if your recipient might have a Java runtime environment but not the jar tool. The text conversion offered by JarX is useful if your distribution will include text files, source, documentation, scripts, etc., and your recipients have platforms with different newline conventions.

Text conversion background

There are two issues in the cross-platform delivery of text files.
  1. Different platforms indicate the end of a line differently. The UNIX convention uses the single character LINE FEED; the (old) Macintosh used only the CARRIAGE RETURN character, and DOS/Windows systems require every line to end with a CARRIAGE RETURN followed by a LINE FEED. If some conversion isn't done, a Windows file appears to have garbage characters at the ends of lines if moved to UNIX, or the beginnings of lines if moved to a Mac; UNIX and Mac files moved to Windows, or Mac files moved to UNIX, appear to be squished into one insanely long line. These effects can complicate viewing and editing the files, and interfere with automated processes like diff or version control.
  2. Different platforms may use different default character encodings. Ideally, text files within a jar should be extracted into the local encoding.

It's important to apply such transformations only to the files within the archive that are actually known to contain text. Passing binary data or class files through character and newline transformations will corrupt them.

The ZIP approach and why it loses

The popular zip format on which jar is based already has a provision for newline (but not character set) conversion. Each entry includes a text/binary bit, and the unzip program applies newline conversion while extracting, but only to the files flagged as text.

One problem, though not the fatal one, with this scheme is that there is no single convention for newlines inside the zip file. Instead, files are stored just as they are found on the source system, and a code indicating the source operating system is stored in the archive. The receiving unzip program must interpret the code and know what newline convention that operating system used, in order to convert newlines appropriately.

The fatal flaw, however, has to do with the way the text/binary bit gets set in the first place. While building the archive, the common zip programs look at statistical properties of byte frequencies in the input, and set the text bit for any entry that looks like it might be text! If a binary file happens to contain an unlucky sequence of bytes, it will be flagged as text and then silently corrupted by any unzip program that honors the text bit. That can happen, and has happened, to class files in zip archives if the recipient uses unzip -a, and causes significant misery if the package is widely distributed.

A better way

Even though the jar format is based on zip, it would be a mistake to make jar tools that rely on the zip text/binary bit, because common practice has made that bit unreliable. What's needed is a standard way for the developer to explicitly indicate the processing needed for each entry in the jar. Also, a single representation should be adopted for newlines in text files inside a jar, so an extracting program only needs to convert from that representation to the local one, and does not need to concern itself with details of the system where the jar was created.

As of JDK 1.3, Sun has extended the Jar File Specification to allow a Content-Type in the Manifest for each jar entry. The value of Content-Type is a MIME type, and with this a developer can specify exactly which entries in a jar should be treated as text. The question of a standard representation for newlines inside the jar is settled, because [RFC2046 section 4.1.1] establishes a canonical line break representation for all subtypes of the text MIME type. Therefore, correct translation of line breaks from any platform to any platform can be achieved if a jar-building program just converts from its local convention to the canonical CRLF form, and a jar extraction program just converts the canonical to its own local form. Neither program needs to know anything about the other environment. Finally, the charset parameter of the text type allows explicit specification of the character encoding used in a jar entry, and the extracting program can automatically convert into the encoding used on the local system. (But see Call to action below.)

What JarX Does

Content-Type entries in a Manifest were introduced in Java 1.3 but are compatible with earlier jar specifications; a jar file containing such entries can be processed without any trouble by any jar tool compliant with the old or new standard. However, there is not yet a full jar tool available that will honor the content types and do automatic transformation of text entries. To fill the need until that functionality is added to the widely-available jar tools, JarX is available now.

JarX.Build produces a jar, working from a manifest file prepared by the developer. Entries with any text type will be translated from the local encoding into the specified charset if given, and entries with the specific type text/plain will have their line endings converted to the CRLF canonical form. Line endings are left alone for all other subtypes of text, but this decision is open to comment.

The file produced by JarX.Build is a fully compliant jar and can be unpacked by any jar or unzip tool, but current tools will not automatically convert the text files to the local conventions. By including the single class file JarX.class in the jar, a developer produces a self-extracting archive that can be executed to unpack itself on any Java 1.6 or later virtual machine, performing all automatic conversions and requiring no jar tool at all.

Building a Jar

To build a jar file, first prepare the manifest, using any text editor or, more likely, a script. Include a Name: entry for every file to be included in the jar. JarX.Build archives only the files named in the manifest. Be sure to include Manifest-Version: 1.0 as the first line of the manifest; JarX.Build does not do it for you. To make the jar self-extracting, make the next line
Main-Class: org.gjt.cuspy.JarX
and be sure to include a Name: entry for org/gjt/cuspy/JarX.class.

Add an appropriate Content-Type: line after the Name: line for every entry that needs one. JarX itself only distinguishes the text types from nontext (everything else), and treats a missing Content-Type: as nontext, so for purposes of JarX you only need to add content types for text files. For other purposes you may wish to include the types of other entries as well. In the simplest case, just omit content types for your non-text files, and add Content-Type: text/plain; charset=UTF-8 for files that you want auto-converted. Then give the command
java org.gjt.cuspy.JarX$Build foo.jar manifest
if manifest is the name of your prepared manifest file and foo.jar names the jar you want to create. The order of files in the jar will be the order of their names in the manifest.

Special manifest attributes

For 2016, JarX now recognizes some special manifest attributes:
_JarX_CharsetInArchive
As a per-entry attribute, identifies the character set of the associated text member as stored in the archive. This is entirely equivalent to the earlier method using ;charset= on the Content-Type attribute, which JarX still supports, but has not been widely adopted. As a main attribute, sets a default for any text members without a per-entry value.
_JarX_CharsetWhenUnpacked
As a per-entry attribute, identifies the character set of the associated text member when not in the archive. At Build time, the member will be transcoded from this charset (instead of the platform's default) to the specified InArchive charset, and, on extraction, will be transcoded back to this charset regardless of the platform's default encoding. This attribute can be used for files conforming to specifications that define a fixed encoding. In other cases, omitting this attribute allows the member to be extracted into the receiving platform's default charset. As a main attribute, sets a default for text members without a per-entry value.
_JarX_Permissions
As a per-entry attribute, declares permissions to apply to the extracted file. (At present, not applied to directories.) Only the Java SE 6 File permissions are supported, a small subset of what most platforms support. A comma-separated list of usage=bywhom, where usage can be read, write, or execute and bywhom can be none, owner, or all. As a main attribute, sets a default for members without a per-entry attribute. For any usage that is left unspecified, no File method will be called to change that permission, so the system's defaults will apply.
_JarX_PathResolver
Only recognized as a main attribute, this specifies a script that JarX will invoke for every archive member, with the following bindings in scope:
properties
The Java system Properties object.
storedPath
The full pathname of the member, exactly as stored in the archive.
platformPath
The full pathname after only replacing the / separator character with the platform's file.separator if different.
computedPath
Initially the same as platformPath. If the script stores a new value in computedPath, the member will be extracted to that full path.
The script is given as the value of this attribute, using the same RFC822-ish lexical conventions the jar spec says it was "inspired by". The value must begin with a MIME type (two atoms separated by a slash, as in application/javascript, followed by at least one QUOTEDSTRING. RFC822 uses the double-quote for this purpose, and backslash to escape it when needed, which also means you must double any backslash intended for the script. Additional QUOTEDSTRINGs simply append to the script. The RFC822 line-continuation rule can be exploited by supplying the script as multiple quoted strings, one per line, each indented by a space. The strings are appended with nothing in between (so, the continuation newlines do not become newlines in the script), but a / can appear between any two quoted strings to insert an explicit newline in the script. In addition to whatever comment syntax is allowed in the scripting language, RFC822 comments (marked by parentheses, and nestable) are allowed outside of the quoted strings.

Alternative to ScriptEngine for a path resolver

With the removal of Nashorn in Java 15, leaving no scripting language that can be assumed present in the Java runtime, a script in the manifest may no longer be the simplest way to customize the resolution of path names when extracting. This class has been refactored now to expose two methods, prepareResolver and resolve, easily overridden in a subclass. The value of the _JarX_PathResolver main attribute is passed to prepareResolver as a string (so it can be parsed in any way useful to the subclass, not necessarily as described above, or ignored), and resolve is passed the stored path and platform path, and returns the platform path unchanged or a replacement. A self-extracting jar with resolution can be made without depending on any script engine, by placing two classes in the jar, JarX and the subclass, and naming the subclass as the jar's Main-Class. It needs a main method that simply instantiates the class and calls extract().

Extracting a jar

The command java -jar foo.jar is all it takes to extract a jar. The Main-Class entry in the manifest identifies the entry point of JarX so it does not need to be specified.

Call to action

At the moment, Sun's Jar File Specification contains a mistake in the description of a content type that could lead to implementations that reject valid content types. Squash this bug before it bites: log on to the Java Developer Connection (it's free) and cast one, two, or all three of your Bug Votes for Bug #4310708.

Miscellany

This class is a little sloppy and relatively slow, especially the Build side when converting plain text files. The idea for JarX is a natural outgrowth of the Java 1.3 manifest standard and I have suggested that the functionality of JarX be added into the widely available jar tools. If Sun takes the suggestion then the functionality of JarX will soon be provided by nice fast optimized tools and it won't be necessary to spend a lot of time polishing JarX.

Error handling is roughly nonexistent. JarX is careful to avoid silent corruption of data, even verifying that all character encoding calls are successful, but makes no attempt to be graceful about errors or surprises. If something doesn't work the likely result is a one line message and abrupt exit, or an uncaught exception and stack trace.

The coding style is a little contrived just to arrange it so JarX.class is the only file needed in the jar to make it self-extracting. In particular the JarX class is also written to serve as the class of tokens returned by the structured-field-body lexer, to avoid introducing a second class. Weird, perhaps, but harmless weird.

Version:
$Id$
Author:
Chapman Flack
  • Field Details

    • treatment

      protected String treatment
      How to treat the entry being processed: bytes, characters, lines. Used only in the JarX instance created by main(). Set by classify(). Only the exact String instances BYTES, CHARACTERS, LINES are to be used.
    • BYTES

      protected static final String BYTES
      See Also:
      Constant Field Values
    • CHARACTERS

      protected static final String CHARACTERS
      See Also:
      Constant Field Values
    • LINES

      protected static final String LINES
      See Also:
      Constant Field Values
    • archiveCharset

      protected Charset archiveCharset
      Charset (in archive) of the entry being processed. Used only in the JarX instance created by main(). Set by classify().
    • unpackedCharset

      protected Charset unpackedCharset
      Charset when unpacked of the entry being processed. Used only in the JarX instance created by main(). Set by classify().
    • readPermission

      protected String readPermission
      Read permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply.
    • writePermission

      protected String writePermission
      Write permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply.
    • executePermission

      protected String executePermission
      Execute permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply.
    • NONE

      protected static final String NONE
      See Also:
      Constant Field Values
    • OWNER

      protected static final String OWNER
      See Also:
      Constant Field Values
    • ALL

      protected static final String ALL
      See Also:
      Constant Field Values
    • defaultTreatment

      protected String defaultTreatment
      As for treatment, but set from main attributes (or BYTES if not present).
    • defaultArchiveCharset

      protected Charset defaultArchiveCharset
      As for archiveCharset, but set from main attributes (default UTF-8).
    • defaultUnpackedCharset

      protected Charset defaultUnpackedCharset
      As for unpackedCharset, but set from main attributes or platform default.
    • defaultReadPermission

      protected String defaultReadPermission
      As for readPermission but set from main attributes, null if not present.
    • defaultWritePermission

      protected String defaultWritePermission
      As for writePermission but set from main attributes, null if not present.
    • defaultExecutePermission

      protected String defaultExecutePermission
      As for executePermission but set from main attributes, null if not present.
    • resolverEngine

      protected ScriptEngine resolverEngine
      Script engine to run the name resolver script, if any.
    • resolverScript

      protected String resolverScript
      The name resolver script, if any.
    • ARCHIVE_CHARSET

      public final Attributes.Name ARCHIVE_CHARSET
      Attribute name for specifying the in-archive charset. The Java powers that be didn't go for Bug #4310708 so there needs to be a dedicated manifest key for this (though JarX will still honor ;charset= on the Content-Type too).
    • UNPACKED_CHARSET

      public final Attributes.Name UNPACKED_CHARSET
      Attribute name for specifying the when-unpacked charset. This was not in the original JarX; the platform default was always used, and still is if this attribute is not present.
    • PERMISSIONS

      public final Attributes.Name PERMISSIONS
      Permissions (only as supported in java.io.File for SE 6) spec *(, spec) where spec is action=whom, action is read, write, or execute, and whom is none, owner, or all.
    • PATHRESOLVER

      public final Attributes.Name PATHRESOLVER
      Main attribute to specify a JSR223 script to control extracted names.
    • mainAttributes

      protected Attributes mainAttributes
      Main attributes saved from the manifest (which must be seen early).
    • type

      public short type
      Token type, when JarX objects are used to return content type tokens
    • value

      public String value
      Token text when JarX objects are used to return content type tokens
    • ATOM

      public static final short ATOM
      Token types from the structured field body lexer defined in RFC822 as modified in RFC2045. Also state numbers for the automaton in structuredFieldBody.
      See Also:
      Constant Field Values
    • COMMENT

      public static final short COMMENT
      See Also:
      Constant Field Values
    • DOMAINLITERAL

      public static final short DOMAINLITERAL
      See Also:
      Constant Field Values
    • QUOTEDSTRING

      public static final short QUOTEDSTRING
      See Also:
      Constant Field Values
    • TSPECIAL

      public static final short TSPECIAL
      See Also:
      Constant Field Values
    • manifestName

      public static final String manifestName
      Name of the manifest file as stored in the jar
      See Also:
      Constant Field Values
    • manifestCode

      public static final String manifestCode
      The (fixed) encoding used for manifest content
      See Also:
      Constant Field Values
  • Constructor Details

    • JarX

      public JarX()
      Public constructor for an application using JarX to unpack jars.
    • JarX

      protected JarX​(short t, String v)
      Constructor for JarX objects used as tokens returned by the lexer.
      Parameters:
      t - the type of this token
      v - the corresponding text (with delimiters removed and backslashes resolved for quoted strings, domain text, and comments)
  • Method Details

    • is

      public boolean is​(short... type)
      True if this JarX object represents a token of one of the given types.
      Parameters:
      type - allowable types
      Returns:
      as titled
    • holds

      public boolean holds​(String value, short... type)
      True if this JarX object represents a token of one of the given types and its value equals the given string.
      Parameters:
      value - string value for comparison
      type - allowable types
      Returns:
      as titled
    • holdsIgnoreCase

      public boolean holdsIgnoreCase​(String value, short... type)
      True if this JarX object represents a token of one of the given types and its value equals the given string, case-insensitively.
      Parameters:
      value - string value for comparison
      type - allowable types
      Returns:
      as titled
    • main

      public static void main​(String[] args) throws Exception
      The entry point for extracting.
      Parameters:
      args - argument list
      Throws:
      Exception - if anything doesn't work, punt
    • extract

      public void extract() throws Exception
      Find the jar I was loaded from and extract all entries except my own class file.
      Throws:
      Exception - if anything doesn't work, punt
    • notMe

      public boolean notMe​(String name)
      True if the passed name is not the in-jar name of this class or related classes that should not be extracted.

      If not overridden, this method returns false only for names matching the class of this or any ancestral superclass. Interfaces are not considered. A subclass could apply a different policy.

    • setDefaults

      public void setDefaults​(Attributes mainAttributes)
      Examine the main attributes to set any defaults. Includes loading the required script engine if a name resolver script is given.
      Parameters:
      mainAttributes - as obtained from the manifest
    • prepareResolver

      public void prepareResolver​(String v) throws Exception
      Prepare a resolver of pathnames, given the value of the PATHRESOLVER main attribute.

      If not overridden in a subclass, this method parses it as a MIME type and script as described in the class comments, loads a ScriptEngine for the MIME type, and saves references to the engine in resolverEngine and the script in resolverScript.

      Parameters:
      v - value of the _JarX_PathResolver main attribute
      Throws:
      Exception - this implementation throws no checked exceptions, but an overriding implementation may
    • resolve

      public String resolve​(String orig, String plat) throws Exception
      Called with every path to be extracted; returns a possibly-corrected path.

      If not overridden in a subclass, this method returns s unchanged if no resolverScript has been set, and otherwise invokes the script with storedPath bound to orig, platformPath and computedPath both bound to plat, then returns the value bound to computedPath when the script has returned.

      Parameters:
      orig - The path as stored in the archive, always /-separated
      plat - The path after only replacing / with the platform separator
      Returns:
      plat unchanged, or a corrected location for extracting the entry, or null to suppress extracting the entry
      Throws:
      Exception - this implementation may throw ScriptException, an overriding implementation may throw others
    • classify

      public void classify​(Attributes atts, boolean lazy)
      Set instance variables for text/binary and permissions treatment according to the passed Attributes.
      Parameters:
      atts - Usually a per-entry attribute set, but classify is also called by setDefaults to parse the main attributes.
      lazy - In the usual case, as soon as an entry is classified as non-text, classify can return without looking for charset information. When called by setDefaults, however, laziness is not appropriate.
    • archiveCharsetFromType

      protected boolean archiveCharsetFromType​(JarX[] type)
      Parse a Content-Type for any charset parameter.
      Parameters:
      type - tokenized Content-Type value
      Returns:
      true if the Content-Type specified a charset
    • extract

      public void extract​(JarEntry je, InputStream is) throws Exception
      Extract a single entry, performing any appropriate conversion
      Parameters:
      je - JarEntry for the current entry
      is - InputStream with the current entry content
      Throws:
      Exception - IOException for any problem involving I/O, ScriptException possible from the non-overridden path resolver, others possible in an overridden implementation
    • shovel

      public void shovel​(InputStream is, OutputStream os) throws IOException
      Copy content from an input to an output stream until end. Whether the content is shoveled as bytes, characters, or lines will be determined by instance variables that have been set by calling classify before calling this method.
      Parameters:
      is - source of input
      os - destination for output
      Throws:
      IOException - for any problem involving I/O
    • shovelBytes

      public static void shovelBytes​(InputStream is, OutputStream os) throws IOException
      Copy bytes from an input to an output stream until end. No character encoding or newline conversion applies.
      Parameters:
      is - source of input
      os - destination for output
      Throws:
      IOException - for any problem involving I/O
    • shovelText

      public void shovelText​(InputStream is, OutputStream os) throws IOException
      Copy text from an input to an output stream until end. Determines the encoding transformation to use (based on the charset content-type parameter) and whether to copy as lines (with newline conversion) or unmolested characters. text/plain is copied as lines, all other text subtypes as characters.
      Parameters:
      is - source of input
      os - destination of output
      Throws:
      IOException - for any problem involving I/O
    • shovelLines

      public void shovelLines​(InputStream is, OutputStream os) throws IOException
      Copy lines of text from an input from an output stream, applying the specified character encoding and translating newlines. This method handles the extracting case, where the named encoding is associated with the input stream (jar) and the platform default encoding with the output (local file), and the local line.separator is used to separate lines on the output. Overridden in build to do the reverse when building a jar. To avoid silent corruption of data, this method verifies that all characters from the jar are successfully converted to the local platform's encoding.
      Parameters:
      is - the source of input
      os - destination for output
      Throws:
      IOException - for any problem involving I/O
    • shovelChars

      public void shovelChars​(InputStream is, OutputStream os) throws IOException
      Copy characters of text from an input from an output stream, applying the specified character encoding but not translating newlines. This method handles the extracting case, where the named encoding is associated with the input stream (jar) and the platform default encoding with the output (local file). Overridden in build to do the reverse when building a jar. To avoid silent corruption of data, this method verifies that all characters from the jar are successfully converted to the local platform's encoding.
      Parameters:
      is - the source of input
      os - destination for output
      Throws:
      IOException - for any problem involving I/O
    • describeTranscoding

      public String describeTranscoding​(InputStreamReader isr, OutputStreamWriter osw)
    • structuredFieldBody

      public static JarX[] structuredFieldBody​(String field, int off)
      Lexical analyzer for structured field bodies as described in RFC822 and modified in RFC2045. Comments are processed and stored in tokens that are, at the last minute, excluded from the returned token list; only two lines would need to be changed to use this lexer in an application that wanted comments returned.
      Parameters:
      field - a header field
      off - offset to the start of the structured field body (skip the field name and colon)
      Returns:
      An array of tokens with any COMMENT tokens (for JarX purposes) excluded