Class JarX
- Direct Known Subclasses:
JarX.Build
,Node
A self-extracting file is handy if your recipient might have a Java runtime environment but not the jar tool. The text conversion offered by JarX is useful if your distribution will include text files, source, documentation, scripts, etc., and your recipients have platforms with different newline conventions.
Text conversion background
There are two issues in the cross-platform delivery of text files.- Different platforms indicate the end of a line differently. The UNIX convention uses the single character LINE FEED; the (old) Macintosh used only the CARRIAGE RETURN character, and DOS/Windows systems require every line to end with a CARRIAGE RETURN followed by a LINE FEED. If some conversion isn't done, a Windows file appears to have garbage characters at the ends of lines if moved to UNIX, or the beginnings of lines if moved to a Mac; UNIX and Mac files moved to Windows, or Mac files moved to UNIX, appear to be squished into one insanely long line. These effects can complicate viewing and editing the files, and interfere with automated processes like diff or version control.
- Different platforms may use different default character encodings. Ideally, text files within a jar should be extracted into the local encoding.
It's important to apply such transformations only to the files within the archive that are actually known to contain text. Passing binary data or class files through character and newline transformations will corrupt them.
The ZIP approach and why it loses
The popular zip format on which jar is based already has a provision for newline (but not character set) conversion. Each entry includes a text/binary bit, and the unzip program applies newline conversion while extracting, but only to the files flagged as text.One problem, though not the fatal one, with this scheme is that there is no single convention for newlines inside the zip file. Instead, files are stored just as they are found on the source system, and a code indicating the source operating system is stored in the archive. The receiving unzip program must interpret the code and know what newline convention that operating system used, in order to convert newlines appropriately.
The fatal flaw, however, has to do with the way the text/binary bit gets set in the first place. While building the archive, the common zip programs look at statistical properties of byte frequencies in the input, and set the text bit for any entry that looks like it might be text! If a binary file happens to contain an unlucky sequence of bytes, it will be flagged as text and then silently corrupted by any unzip program that honors the text bit. That can happen, and has happened, to class files in zip archives if the recipient uses unzip -a, and causes significant misery if the package is widely distributed.
A better way
Even though the jar format is based on zip, it would be a mistake to make jar tools that rely on the zip text/binary bit, because common practice has made that bit unreliable. What's needed is a standard way for the developer to explicitly indicate the processing needed for each entry in the jar. Also, a single representation should be adopted for newlines in text files inside a jar, so an extracting program only needs to convert from that representation to the local one, and does not need to concern itself with details of the system where the jar was created.
As of JDK 1.3, Sun has extended the
Jar File Specification to allow a Content-Type
in the
Manifest for each jar entry. The value of Content-Type
is a
MIME
type, and with this a developer can specify exactly which entries in a
jar should be treated as text. The question of a standard representation
for newlines inside the jar is settled, because
[RFC2046 section 4.1.1]
establishes a canonical line break representation for all subtypes of the
text
MIME type. Therefore, correct translation of line breaks from any
platform to any platform can be achieved if a jar-building program just
converts from its local convention to the canonical CRLF form, and a jar
extraction program just converts the canonical to its own local form. Neither
program needs to know anything about the other environment.
Finally, the charset
parameter of the text
type
allows explicit specification of the character encoding used in a jar entry,
and the extracting program can automatically convert into the encoding used
on the local system. (But see Call to action below.)
What JarX Does
Content-Type
entries in a Manifest were introduced in Java 1.3
but are compatible with earlier jar specifications; a jar file containing
such entries can be processed without any trouble by any jar tool compliant
with the old or new standard. However, there is not yet a full jar tool
available that will honor the content types and do automatic transformation
of text entries. To fill the need until that functionality is added to the
widely-available jar tools, JarX is available now.
JarX.Build produces a jar, working from a manifest file prepared by the
developer. Entries with any text
type will be translated from
the local encoding into the specified charset
if given, and
entries with the specific type text/plain
will have their line
endings converted to the CRLF canonical form. Line endings are left alone
for all other subtypes of text
, but this decision is open to
comment.
The file produced by JarX.Build is a fully compliant jar and can be unpacked
by any jar or unzip tool, but current tools will not automatically convert
the text files to the local conventions. By including the single class file
JarX.class
in the jar, a developer produces a self-extracting
archive that can be executed to unpack itself on any Java 1.6 or later
virtual machine, performing all automatic conversions and requiring no jar
tool at all.
Building a Jar
To build a jar file, first prepare the manifest, using any text editor or, more likely, a script. Include aName:
entry for every file
to be included in the jar. JarX.Build archives only the files named in
the manifest. Be sure to include Manifest-Version: 1.0
as
the first line of the manifest; JarX.Build does not do it for you. To make
the jar self-extracting, make the next lineMain-Class: org.gjt.cuspy.JarX
and be sure to include a
Name:
entry for org/gjt/cuspy/JarX.class
.
Add an appropriate Content-Type:
line after the
Name:
line for every entry that needs one. JarX itself only
distinguishes the text
types from nontext (everything else),
and treats a missing Content-Type:
as nontext, so for purposes
of JarX you only need to add content types for text files. For other
purposes you may wish to include the types of other entries as well.
In the simplest case, just omit content types for your non-text files,
and add Content-Type: text/plain; charset=UTF-8
for files that
you want auto-converted. Then give the command
java org.gjt.cuspy.JarX$Build foo.jar manifest
if
manifest
is the name of your prepared manifest file and
foo.jar
names the jar you want to create.
The order of files in the jar will be the order of their names in the
manifest.
Special manifest attributes
For 2016, JarX now recognizes some special manifest attributes:- _JarX_CharsetInArchive
- As a per-entry attribute, identifies the character set of the associated
text member as stored in the archive. This is entirely equivalent to the
earlier method using
;charset=
on the Content-Type attribute, which JarX still supports, but has not been widely adopted. As a main attribute, sets a default for any text members without a per-entry value. - _JarX_CharsetWhenUnpacked
- As a per-entry attribute, identifies the character set of the associated text member when not in the archive. At Build time, the member will be transcoded from this charset (instead of the platform's default) to the specified InArchive charset, and, on extraction, will be transcoded back to this charset regardless of the platform's default encoding. This attribute can be used for files conforming to specifications that define a fixed encoding. In other cases, omitting this attribute allows the member to be extracted into the receiving platform's default charset. As a main attribute, sets a default for text members without a per-entry value.
- _JarX_Permissions
- As a per-entry attribute, declares permissions to apply to the
extracted file. (At present, not applied to directories.) Only the
Java SE 6
File
permissions are supported, a small subset of what most platforms support. A comma-separated list of usage=
bywhom, where usage can beread
,write
, orexecute
and bywhom can benone
,owner
, orall
. As a main attribute, sets a default for members without a per-entry attribute. For any usage that is left unspecified, noFile
method will be called to change that permission, so the system's defaults will apply. - _JarX_PathResolver
- Only recognized as a main attribute, this specifies a script that JarX
will invoke for every archive member, with the following bindings in scope:
- properties
- The Java system Properties object.
- storedPath
- The full pathname of the member, exactly as stored in the archive.
- platformPath
- The full pathname after only replacing the
/
separator character with the platform'sfile.separator
if different. - computedPath
- Initially the same as
platformPath
. If the script stores a new value incomputedPath
, the member will be extracted to that full path.
application/javascript
, followed by at least oneQUOTEDSTRING
. RFC822 uses the double-quote for this purpose, and backslash to escape it when needed, which also means you must double any backslash intended for the script. AdditionalQUOTEDSTRING
s simply append to the script. The RFC822 line-continuation rule can be exploited by supplying the script as multiple quoted strings, one per line, each indented by a space. The strings are appended with nothing in between (so, the continuation newlines do not become newlines in the script), but a/
can appear between any two quoted strings to insert an explicit newline in the script. In addition to whatever comment syntax is allowed in the scripting language, RFC822 comments (marked by parentheses, and nestable) are allowed outside of the quoted strings.
Alternative to ScriptEngine
for a path resolver
With the removal of Nashorn in Java 15, leaving no scripting language that
can be assumed present in the Java runtime, a script in the manifest may
no longer be the simplest way to customize the resolution of path names when
extracting. This class has been refactored now to expose two methods,
prepareResolver
and
resolve
, easily overridden in a subclass.
The value of the _JarX_PathResolver
main attribute is passed to
prepareResolver
as a string (so it can be parsed in any way useful to
the subclass, not necessarily as described above, or ignored), and
resolve
is passed the stored path and platform path, and returns the
platform path unchanged or a replacement. A self-extracting jar with
resolution can be made without depending on any script engine, by placing
two classes in the jar, JarX and the subclass, and naming the
subclass as the jar's Main-Class
. It needs a main
method that
simply instantiates the class and calls extract()
.
Extracting a jar
The commandjava -jar foo.jar
is all it takes
to extract a jar. The Main-Class
entry in the manifest
identifies the entry point of JarX so it does not need to be specified.
Call to action
At the moment, Sun's Jar File Specification contains a mistake in the description of a content type that could lead to implementations that reject valid content types. Squash this bug before it bites: log on to the Java Developer Connection (it's free) and cast one, two, or all three of your Bug Votes for Bug #4310708.Miscellany
This class is a little sloppy and relatively slow, especially the Build side when converting plain text files. The idea for JarX is a natural outgrowth of the Java 1.3 manifest standard and I have suggested that the functionality of JarX be added into the widely available jar tools. If Sun takes the suggestion then the functionality of JarX will soon be provided by nice fast optimized tools and it won't be necessary to spend a lot of time polishing JarX.Error handling is roughly nonexistent. JarX is careful to avoid silent corruption of data, even verifying that all character encoding calls are successful, but makes no attempt to be graceful about errors or surprises. If something doesn't work the likely result is a one line message and abrupt exit, or an uncaught exception and stack trace.
The coding style is a little contrived just to arrange it so JarX.class is the only file needed in the jar to make it self-extracting. In particular the JarX class is also written to serve as the class of tokens returned by the structured-field-body lexer, to avoid introducing a second class. Weird, perhaps, but harmless weird.
- Version:
- $Id$
- Author:
- Chapman Flack
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Subclass of JarX containing the code needed to build jars. -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final String
final Attributes.Name
Attribute name for specifying the in-archive charset.protected Charset
Charset (in archive) of the entry being processed.static final short
protected static final String
protected static final String
static final short
protected Charset
As for archiveCharset, but set from main attributes (default UTF-8).protected String
As for executePermission but set from main attributes, null if not present.protected String
As for readPermission but set from main attributes, null if not present.protected String
As for treatment, but set from main attributes (or BYTES if not present).protected Charset
As for unpackedCharset, but set from main attributes or platform default.protected String
As for writePermission but set from main attributes, null if not present.static final short
protected String
Execute permission to be set on the file.protected static final String
protected Attributes
Main attributes saved from the manifest (which must be seen early).static final String
The (fixed) encoding used for manifest contentstatic final String
Name of the manifest file as stored in the jarprotected static final String
protected static final String
final Attributes.Name
Main attribute to specify a JSR223 script to control extracted names.final Attributes.Name
Permissions (only as supported in java.io.File for SE 6) spec *(, spec) where spec is action=whom, action is read, write, or execute, and whom is none, owner, or all.static final short
protected String
Read permission to be set on the file.protected ScriptEngine
Script engine to run the name resolver script, if any.protected String
The name resolver script, if any.protected String
How to treat the entry being processed: bytes, characters, lines.static final short
short
Token type, when JarX objects are used to return content type tokensfinal Attributes.Name
Attribute name for specifying the when-unpacked charset.protected Charset
Charset when unpacked of the entry being processed.Token text when JarX objects are used to return content type tokensprotected String
Write permission to be set on the file. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected boolean
archiveCharsetFromType
(JarX[] type) Parse a Content-Type for anycharset
parameter.void
classify
(Attributes atts, boolean lazy) Set instance variables for text/binary and permissions treatment according to the passed Attributes.void
extract()
Find the jar I was loaded from and extract all entries except my own class file.void
extract
(JarEntry je, InputStream is) Extract a single entry, performing any appropriate conversionboolean
True if this JarX object represents a token of one of the given types and its value equals the given string.boolean
holdsIgnoreCase
(String value, short... type) True if this JarX object represents a token of one of the given types and its value equals the given string, case-insensitively.boolean
is
(short... type) True if this JarX object represents a token of one of the given types.static void
The entry point for extracting.boolean
True if the passed name is not the in-jar name of this class or related classes that should not be extracted.void
Prepare a resolver of pathnames, given the value of the PATHRESOLVER main attribute.Called with every path to be extracted; returns a possibly-corrected path.void
setDefaults
(Attributes mainAttributes) Examine the main attributes to set any defaults.void
shovel
(InputStream is, OutputStream os) Copy content from an input to an output stream until end.static void
shovelBytes
(InputStream is, OutputStream os) Copy bytes from an input to an output stream until end.void
shovelChars
(InputStream is, OutputStream os) Copy characters of text from an input from an output stream, applying the specified character encoding but not translating newlines.void
shovelLines
(InputStream is, OutputStream os) Copy lines of text from an input from an output stream, applying the specified character encoding and translating newlines.void
shovelText
(InputStream is, OutputStream os) Copy text from an input to an output stream until end.static JarX[]
structuredFieldBody
(String field, int off)
-
Field Details
-
treatment
How to treat the entry being processed: bytes, characters, lines. Used only in the JarX instance created by main(). Set by classify(). Only the exact String instances BYTES, CHARACTERS, LINES are to be used. -
BYTES
- See Also:
-
CHARACTERS
- See Also:
-
LINES
- See Also:
-
archiveCharset
Charset (in archive) of the entry being processed. Used only in the JarX instance created by main(). Set by classify(). -
unpackedCharset
Charset when unpacked of the entry being processed. Used only in the JarX instance created by main(). Set by classify(). -
readPermission
Read permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply. -
writePermission
Write permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply. -
executePermission
Execute permission to be set on the file. Only the final Strings NONE, OWNER, or ALL are to be used, or null, in which case no explicit setting is made and the OS defaults apply. -
NONE
- See Also:
-
OWNER
- See Also:
-
ALL
- See Also:
-
defaultTreatment
As for treatment, but set from main attributes (or BYTES if not present). -
defaultArchiveCharset
As for archiveCharset, but set from main attributes (default UTF-8). -
defaultUnpackedCharset
As for unpackedCharset, but set from main attributes or platform default. -
defaultReadPermission
As for readPermission but set from main attributes, null if not present. -
defaultWritePermission
As for writePermission but set from main attributes, null if not present. -
defaultExecutePermission
As for executePermission but set from main attributes, null if not present. -
resolverEngine
Script engine to run the name resolver script, if any. -
resolverScript
The name resolver script, if any. -
ARCHIVE_CHARSET
Attribute name for specifying the in-archive charset. The Java powers that be didn't go for Bug #4310708 so there needs to be a dedicated manifest key for this (though JarX will still honor ;charset= on the Content-Type too). -
UNPACKED_CHARSET
Attribute name for specifying the when-unpacked charset. This was not in the original JarX; the platform default was always used, and still is if this attribute is not present. -
PERMISSIONS
Permissions (only as supported in java.io.File for SE 6) spec *(, spec) where spec is action=whom, action is read, write, or execute, and whom is none, owner, or all. -
PATHRESOLVER
Main attribute to specify a JSR223 script to control extracted names. -
mainAttributes
Main attributes saved from the manifest (which must be seen early). -
type
public short typeToken type, when JarX objects are used to return content type tokens -
value
Token text when JarX objects are used to return content type tokens -
ATOM
public static final short ATOMToken types from the structured field body lexer defined in RFC822 as modified in RFC2045. Also state numbers for the automaton instructuredFieldBody
.- See Also:
-
COMMENT
public static final short COMMENT- See Also:
-
DOMAINLITERAL
public static final short DOMAINLITERAL- See Also:
-
QUOTEDSTRING
public static final short QUOTEDSTRING- See Also:
-
TSPECIAL
public static final short TSPECIAL- See Also:
-
manifestName
Name of the manifest file as stored in the jar- See Also:
-
manifestCode
The (fixed) encoding used for manifest content- See Also:
-
-
Constructor Details
-
JarX
public JarX()Public constructor for an application using JarX to unpack jars. -
JarX
Constructor for JarX objects used as tokens returned by the lexer.- Parameters:
t
- the type of this tokenv
- the corresponding text (with delimiters removed and backslashes resolved for quoted strings, domain text, and comments)
-
-
Method Details
-
is
public boolean is(short... type) True if this JarX object represents a token of one of the given types.- Parameters:
type
- allowable types- Returns:
- as titled
-
holds
True if this JarX object represents a token of one of the given types and its value equals the given string.- Parameters:
value
- string value for comparisontype
- allowable types- Returns:
- as titled
-
holdsIgnoreCase
True if this JarX object represents a token of one of the given types and its value equals the given string, case-insensitively.- Parameters:
value
- string value for comparisontype
- allowable types- Returns:
- as titled
-
main
The entry point for extracting.- Parameters:
args
- argument list- Throws:
Exception
- if anything doesn't work, punt
-
extract
Find the jar I was loaded from and extract all entries except my own class file.- Throws:
Exception
- if anything doesn't work, punt
-
notMe
True if the passed name is not the in-jar name of this class or related classes that should not be extracted.If not overridden, this method returns false only for names matching the class of
this
or any ancestral superclass. Interfaces are not considered. A subclass could apply a different policy. -
setDefaults
Examine the main attributes to set any defaults. Includes loading the required script engine if a name resolver script is given.- Parameters:
mainAttributes
- as obtained from the manifest
-
prepareResolver
Prepare a resolver of pathnames, given the value of the PATHRESOLVER main attribute.If not overridden in a subclass, this method parses it as a MIME type and script as described in the class comments, loads a
ScriptEngine
for the MIME type, and saves references to the engine inresolverEngine
and the script inresolverScript
.- Parameters:
v
- value of the _JarX_PathResolver main attribute- Throws:
Exception
- this implementation throws no checked exceptions, but an overriding implementation may
-
resolve
Called with every path to be extracted; returns a possibly-corrected path.If not overridden in a subclass, this method returns s unchanged if no
resolverScript
has been set, and otherwise invokes the script withstoredPath
bound to orig,platformPath
andcomputedPath
both bound to plat, then returns the value bound tocomputedPath
when the script has returned.- Parameters:
orig
- The path as stored in the archive, always /-separatedplat
- The path after only replacing / with the platform separator- Returns:
- plat unchanged, or a corrected location for extracting the entry, or null to suppress extracting the entry
- Throws:
Exception
- this implementation may throw ScriptException, an overriding implementation may throw others
-
classify
Set instance variables for text/binary and permissions treatment according to the passed Attributes.- Parameters:
atts
- Usually a per-entry attribute set, butclassify
is also called bysetDefaults
to parse the main attributes.lazy
- In the usual case, as soon as an entry is classified as non-text,classify
can return without looking for charset information. When called bysetDefaults
, however, laziness is not appropriate.
-
archiveCharsetFromType
Parse a Content-Type for anycharset
parameter.- Parameters:
type
- tokenized Content-Type value- Returns:
- true if the Content-Type specified a charset
-
extract
Extract a single entry, performing any appropriate conversion- Parameters:
je
- JarEntry for the current entryis
- InputStream with the current entry content- Throws:
Exception
- IOException for any problem involving I/O, ScriptException possible from the non-overridden path resolver, others possible in an overridden implementation
-
shovel
Copy content from an input to an output stream until end. Whether the content is shoveled as bytes, characters, or lines will be determined by instance variables that have been set by callingclassify
before calling this method.- Parameters:
is
- source of inputos
- destination for output- Throws:
IOException
- for any problem involving I/O
-
shovelBytes
Copy bytes from an input to an output stream until end. No character encoding or newline conversion applies.- Parameters:
is
- source of inputos
- destination for output- Throws:
IOException
- for any problem involving I/O
-
shovelText
Copy text from an input to an output stream until end. Determines the encoding transformation to use (based on thecharset
content-type parameter) and whether to copy as lines (with newline conversion) or unmolested characters.text/plain
is copied as lines, all other text subtypes as characters.- Parameters:
is
- source of inputos
- destination of output- Throws:
IOException
- for any problem involving I/O
-
shovelLines
Copy lines of text from an input from an output stream, applying the specified character encoding and translating newlines. This method handles the extracting case, where the named encoding is associated with the input stream (jar) and the platform default encoding with the output (local file), and the local line.separator is used to separate lines on the output. Overridden inbuild
to do the reverse when building a jar. To avoid silent corruption of data, this method verifies that all characters from the jar are successfully converted to the local platform's encoding.- Parameters:
is
- the source of inputos
- destination for output- Throws:
IOException
- for any problem involving I/O
-
shovelChars
Copy characters of text from an input from an output stream, applying the specified character encoding but not translating newlines. This method handles the extracting case, where the named encoding is associated with the input stream (jar) and the platform default encoding with the output (local file). Overridden inbuild
to do the reverse when building a jar. To avoid silent corruption of data, this method verifies that all characters from the jar are successfully converted to the local platform's encoding.- Parameters:
is
- the source of inputos
- destination for output- Throws:
IOException
- for any problem involving I/O
-
describeTranscoding
-
structuredFieldBody
Lexical analyzer for structured field bodies as described in RFC822 and modified in RFC2045. Comments are processed and stored in tokens that are, at the last minute, excluded from the returned token list; only two lines would need to be changed to use this lexer in an application that wanted comments returned.- Parameters:
field
- a header fieldoff
- offset to the start of the structured field body (skip the field name and colon)- Returns:
- An array of
tokens
with any COMMENT tokens (for JarX purposes) excluded
-