Saxonica.com

Implementing a collating sequence

Collations used for comparing strings can be specified by means of a URI. A collation URI may be used as an argument to many of the standard functions, and also as an attribute of xsl:sort in XSLT, and in the order by clause of a FLWOR expression in XQuery.

Saxon provides a range of mechanisms for binding collation URIs. The language specifications simply say that collations used in sorting and in string-comparison functions are identified by a URI, and leaves it up to the implementation how these URIs are defined.

There is one predefined collation that cannot be changed. This is the Unicode Codepoint Collation defined in the W3C specifications, currently http://www.w3.org/2005/04/xpath-functions/collation/codepoint. This collates string based on the integer values assigned by Unicode to each character, for example "ah!" sorts before "ah?" because the Unicode codepoints for "ah!" are (97, 104, 33) while the codepoints for "ah?" are (97, 104, 63).

In addition, by default, Saxon allows a collation URI to take the form http://saxon.sf.net/collation?keyword=value;keyword=value;.... The query parameters in the URI can be separated either by ampersands or semicolons, but semicolons are usually more convenient. The keywords available are as follows:

keyword

values

effect

class

fully-qualified Java class name of a class that implements java.util.Comparator.

This parameter should not be combined with any other parameter. An instance of the requested class is created, and is used to perform the comparisons. Note that if the collation is to be used in functions such as contains() and starts-with(), this class must also be a java.text.RuleBasedCollator. This approach allows a user-defined collation to be implemented in Java.

lang

any value allowed for xml:lang, for example en-US for US English

This is used to find the collation appropriate to a Java locale. The collation may be further tailored using the parameters strength and decomposition.

strength

primary, secondary, tertiary, or identical

Indicates the differences that are considered significant when comparing two strings. A/B is a primary difference; A/a is a secondary difference; a/� is a tertiary difference (though this varies by language). So if strength=primary then A=a is true; with strength=secondary then A=a is false but a=� is true; with strength=tertiary then a=� is false.

decomposition

none, standard, full

Indicates how the collator handles Unicode composed characters. See the JDK documentation for details.

This format of URI, http://saxon.sf.net/collation?keyword=value;keyword=value;..., is handled by Saxons default CollationURIResolver. It is possible to replace or supplement this mechanism by registering a user-written CollationURIResolver. This must be an implementation of the interface net.sf.saxon.sort.CollationURIResolver, which only requires a single method, resolve(), to be implemented. The result of the method is in general a Comparator, though if the collation is to be used in functions such as contains() which match parts of a string rather than the whole string, then the result must also be an instance of java.text.Collator.

A user-written CollationURIResolver is registered with the Configuration object, either directly or in the case of XSLT by using the JAXP setAttribute() method on the TransformerFactory (the relevant property name is FeatureKeys.COLLATION_URI_RESOLVER). This applies to all stylesheets and queries compiled and executed under that configuration.

In addition, the APIs provided for executing XPath and XQuery expressions allow named collations to be registered by the calling application, as part of the static context.

For XQuery, the class StaticQueryContext also allows collations to be registered directly by individual names. The system attempts to resolve a URI using these directly-registered names before it invokes the CollationURIResolver.

In Saxon XSLT stylesheets, collations may also be described using a saxon:collation element as a top-level declaration in the stylesheet. In this case the value of the name attribute of the saxon:collation may be used as a collation URI. There is no constraint on the form this URI takes, indeed there is no requirement that it be a legal URI. See saxon:collation for more details.

Next