Zebra is designed to support a wide range of data management applications. The system can be configured to handle virtually any kind of structured data. Each record in the system is associated with a record schema which lends context to the data elements of the record. Any number of record schema can coexist in the system. Although it may be wise to use only a single schema within one database, the system poses no such restrictions.
The record model described in this chapter applies to the fundamental,
structured
record type grs
as introduced in
section
Record Types.
Records pass through three different states during processing in the system.
As mentioned earlier, Zebra places few restrictions on the type of data that you can index and manage. Generally, whatever the form of the data, it is parsed by an input filter specific to that format, and turned into an internal structure that Zebra knows how to handle. This process takes place whenever the record is accessed - for indexing and retrieval.
The RecordType parameter in the zebra.cfg
file, or the -t
option to the indexer tells Zebra how to process input records. Two
basic types of processing are available - raw text and structured
data. Raw text is just that, and it is selected by providing the
argument text to Zebra. Structured records are all handled
internally using the basic mechanisms described in the subsequent
sections. Zebra can read structured records in many different formats.
How this is done is governed by additional parameters after the
"grs" keyboard, separated by "." characters.
Three basic subtypes to the grs type are currently available:
This is the canonical input format — described below. It is a simple SGML-like syntax.
This enables a user-supplied input filter. The mechanisms of these filters are described below.
This enables a user-supplied input filter with Tcl rules (only availble if zebra is compiled with Tcl support).
This allows Zebra to read records in the ISO2709 (MARC) encoding standard. In this case, the last paramemeter abstract syntax names the .abs file (see below) which describes the specific MARC structure of the input record as well as the indexing rules.
Although input data can take any form, it is sometimes useful to describe the record processing capabilities of the system in terms of a single, canonical input format that gives access to the full spectrum of structure and flexibility in the system. In Zebra, this canonical format is an "SGML-like" syntax.
To use the canonical format specify grs.sgml
as the record
type,
Consider a record describing an information resource (such a record is sometimes known as a locator record). It might contain a field describing the distributor of the information resource, which might in turn be partitioned into various fields providing details about the distributor, like this:
<Distributor>
<Name> USGS/WRD </Name>
<Organization> USGS/WRD </Organization>
<Street-Address>
U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
</Street-Address>
<City> ALBUQUERQUE </City>
<State> NM </State>
<Zip-Code> 87102 </Zip-Code>
<Country> USA </Country>
<Telephone> (505) 766-5560 </Telephone>
</Distributor>
NOTE: The indentation used above is used to illustrate how Zebra interprets the markup. The indentation, in itself, has no significance to the parser for the canonical input format, which discards superfluous whitespace.
The keywords surrounded by <...> are tags, while the
sections of text in between are the data elements. A data element
is characterized by its location in the tree that is made up by the
nested elements. Each element is terminated by a closing tag -
beginning with <
/, and containing the same symbolic tag-name as
the corresponding opening tag. The general closing tag - <
>/ -
terminates the element started by the last opening tag. The
structuring of elements is significant. The element Telephone,
for instance, may be indexed and presented to the client differently,
depending on whether it appears inside the Distributor element,
or some other, structured data element such a Supplier element.
The first tag in a record describes the root node of the tree that makes up the total record. In the canonical input format, the root tag should contain the name of the schema that lends context to the elements of the record (see section Internal Representation). The following is a GILS record that contains only a single element (strictly speaking, that makes it an illegal GILS record, since the GILS profile includes several mandatory elements - Zebra does not validate the contents of a record against the Z39.50 profile, however - it merely attempts to match up elements of a local representation with the given schema):
<gils>
<title>Zen and the Art of Motorcycle Maintenance</title>
</gils>
Zebra allows you to provide individual data elements in a number of variant forms. Examples of variant forms are textual data elements which might appear in different languages, and images which may appear in different formats or layouts. The variant system in Zebra is essentially a representation of the variant mechanism of Z39.50-1995.
The following is an example of a title element which occurs in two different languages.
<title>
<var lang lang "eng">
Zen and the Art of Motorcycle Maintenance</>
<var lang lang "dan">
Zen og Kunsten at Vedligeholde en Motorcykel</>
</title>
The syntax of the variant element is <var class
type value>
. The available values for the class and
type fields are given by the variant set that is associated with the
current schema (see section
Variant Set File).
Variant elements are terminated by the general end-tag </>, by the variant end-tag </var>, by the appearance of another variant tag with the same class and value settings, or by the appearance of another, normal tag. In other words, the end-tags for the variants used in the example above could have been saved.
Variant elements can be nested. The element
<title>
<var lang lang "eng"><var body iana "text/plain">
Zen and the Art of Motorcycle Maintenance
</title>
Associates two variant components to the variant list for the title element.
Given the nesting rules described above, we could write
<title>
<var body iana "text/plain>
<var lang lang "eng">
Zen and the Art of Motorcycle Maintenance
<var lang lang "dan">
Zen og Kunsten at Vedligeholde en Motorcykel
</title>
The title element above comes in two variants. Both have the IANA body type "text/plain", but one is in English, and the other in Danish. The client, using the element selection mechanism of Z39.50, can retrieve information about the available variant forms of data elements, or it can select specific variants based on the requirements of the end-user.
In order to handle general input formats, Zebra allows the operator to define filters which read individual records in their native format and produce an internal representation that the system can work with.
Input filters are ASCII files, generally with the suffix .flt
.
The system looks for the files in the directories given in the
profilePath setting in the zebra.cfg
files. The record type
for the filter is grs.regx.
filter-filename
(fundamental type grs
, file read type regx
, argument
filter-filename).
Generally, an input filter consists of a sequence of rules, where each rule consists of a sequence of expressions, followed by an action. The expressions are evaluated against the contents of the input record, and the actions normally contribute to the generation of an internal representation of the record.
An expression can be either of the following:
The action associated with this expression is evaluated exactly once in the lifetime of the application, before any records are read. It can be used in conjunction with an action that initializes tables or other resources that are used in the processing of input records.
Matches the beginning of the record. It can be used to initialize variables, etc. Typically, the BEGIN rule is also used to establish the root node of the record.
Matches the end of the record - when all of the contents of the record has been processed.
Matches a string of characters from the input record.
This keyword may only be used between two patterns. It matches everything between (not including) those patterns.
THe expression asssociated with this pattern is evaluated once, before the application terminates. It can be used to release system resources - typically ones allocated in the INIT step.
An action is surrounded by curly braces ({...}), and consists of a sequence of statements. Statements may be separated by newlines or semicolons (;). Within actions, the strings that matched the expressions immediately preceding the action can be referred to as $0, $1, $2, etc.
The available statements are:
Begin a new data element. The type is one of the following:
Begin a new record. The followingparameter should be the
name of the schema that describes the structure of the record, eg.
gils
or wais
(see below). The begin record
call should
precede
any other use of the begin statement.
Begin a new tagged element. The parameter is the name of the tag. If the tag is not matched anywhere in the tagsets referenced by the current schema, it is treated as a local string tag.
Begin a new node in a variant tree. The parameters are class type value.
Create a data element. The concatenated arguments make
up the value of the data element. The option -text
signals that
the layout (whitespace) of the data should be retained for
transmission. The option -element
tag wraps the data up in
the tag. The use of the -element
option is equivalent to
preceding the command with a begin element command, and following
it with the end command.
Close a tagged element. If no parameter is given, the last element on the stack is terminated. The first parameter, if any, is a type name, similar to the begin statement. For the element type, a tag name can be provided to terminate a specific tag.
The following input filter reads a Usenet news file, producing a record in the WAIS schema. Note that the body of a news posting is separated from the list of headers by a blank line (or rather a sequence of two newline characters.
BEGIN { begin record wais }
/^From:/ BODY /$/ { data -element name $1 }
/^Subject:/ BODY /$/ { data -element title $1 }
/^Date:/ BODY /$/ { data -element lastModified $1 }
/\n\n/ BODY END {
begin element bodyOfDisplay
begin variant body iana "text/plain"
data -text $1
end record
}
If Zebra is compiled with support for Tcl (Tool Command Language) enabled, the statements described above are supplemented with a complete scripting environment, including control structures (conditional expressions and loop constructs), and powerful string manipulation mechanisms for modifying the elements of a record. Tcl is a popular scripting environment, with several tutorials available both online and in hardcopy.
NOTE: Variant support is not currently available in the input filter, but will be included with one of the next releases.
When records are manipulated by the system, they're represented in a tree-structure, with data elements at the leaf nodes, and tags or variant components at the non-leaf nodes. The root-node identifies the schema that lends context to the tagging and structuring of the record. Imagine a simple record, consisting of a 'title' element and an 'author' element:
TITLE "Zen and the Art of Motorcycle Maintenance"
ROOT
AUTHOR "Robert Pirsig"
A slightly more complex record would have the author element consist of two elements, a surname and a first name:
TITLE "Zen and the Art of Motorcycle Maintenance"
ROOT
FIRST-NAME "Robert"
AUTHOR
SURNAME "Pirsig"
The root of the record will refer to the record schema that describes the structuring of this particular record. The schema defines the element tags (TITLE, FIRST-NAME, etc.) that may occur in the record, as well as the structuring (SURNAME should appear below AUTHOR, etc.). In addition, the schema establishes element set names that are used by the client to request a subset of the elements of a given record. The schema may also establish rules for converting the record to a different schema, by stating, for each element, a mapping to a different tag path.
A data element is characterized by its tag, and its position in the structure of the record. For instance, while the tag "telephone number" may be used different places in a record, we may need to distinguish between these occurrences, both for searching and presentation purposes. For instance, while the phone numbers for the "customer" and the "service provider" are both representatives for the same type of resource (a telephone number), it is essential that they be kept separate. The record schema provides the structure of the record, and names each data element (defined by the sequence of tags - the tag path - by which the element can be reached from the root of the record).
The children of a tag node may be either more tag nodes, a data node (possibly accompanied by tag nodes), or a tree of variant nodes. The children of variant nodes are either more variant nodes or a data node (possibly accompanied by more variant nodes). Each leaf node, which is normally a data node, corresponds to a variant form of the tagged element identified by the tag which parents the variant tree. The following title element occurs in two different languages:
VARIANT LANG=ENG "War and Peace"
TITLE
VARIANT LANG=DAN "Krig og Fred"
Which of the two elements are transmitted to the client by the server depends on the specifications provided by the client, if any.
In practice, each variant node is associated with a triple of class, type, value, corresponding to the variant mechanism of Z39.50.
Data nodes have no children (they are always leaf nodes in the record tree).
NOTE: Documentation needs extension here about types of nodes - numerical, textual, etc., plus the various types of inclusion notes.
The following sections describe the configuration files that govern
the internal management of data records. The system searches for the files
in the directories specified by the profilePath setting in the
zebra.cfg
file.
When Object Identifiers (or OID's) need to be specified in the following
a named OID reference or a raw OID reference may be used. For the named
OID's refer to the source file util/oid.c
from YAZ. The raw
canonical OID's are specified in dot-notation (for example
1.2.840.10003.3.1000.81.1).
The abstract syntax definition (also known as an Abstract Record Structure, or ARS) is the focal point of the record schema description. For a given schema, the ABS file may state any or all of the following:
Several of the entries above simply refer to other files, which describe the given objects.
This section describes the syntax and use of the various tables which are used by the retrieval module.
The number of different file types may appear daunting at first, but each type corresponds fairly clearly to a single aspect of the Z39.50 retrieval facilities. Further, the average database administrator, who is simply reusing an existing profile for which tables already exist, shouldn't have to worry too much about the contents of these tables.
Generally, the files are simple ASCII files, which can be maintained using any text editor. Blank lines, and lines beginning with a (#) are ignored. Any characters on a line followed by a (#) are also ignored. All other lines contain directives, which provide some setting or value to the system. Generally, settings are characterized by a single keyword, identifying the setting, followed by a number of parameters. Some settings are repeatable (r), while others may occur only once in a file. Some settings are optional (o), whicle others again are mandatory (m).
The name of this file type is slightly misleading in Z39.50 terms, since, apart from the actual abstract syntax of the profile, it also includes most of the other definitions that go into a database profile.
When a record in the canonical, SGML-like format is read from a file
or from the database, the first tag of the file should reference the
profile that governs the layout of the record. If the first tag of the
record is, say, <gils>
, the system will look for the profile
definition in the file gils.abs
. Profile definitions are cached,
so they only have to be read once during the lifespan of the current
process.
When writing your own input filters, the record-begin command introduces the profile, and should always be called first thing when introducing a new record.
The file may contain the following directives:
(m) This provides a shorthand name or description for the profile. Mostly useful for diagnostic purposes.
(m) The OID for the profile (name or dotted-numerical list).
(m) The attribute set that is used for indexing and searching records belonging to this profile.
(o) The tag set (if any) that describe that fields of the records. The type, which is optional, specifies the tag type. If not given, the type-specifier in the Tag Set files is used.
(o) The variant set used in the profile.
(o,r) This points to a conversion table that might be used if the client asks for the record in a different schema from the native one.
(o) Points to a file containing parameters for representing the record contents in the ISO2709 syntax. Read the description of the MARC representation facility below.
(o,r) Associates the given element set name with an element selection file. If an (@) is given in place of the filename, this corresponds to a null mapping for the given element set name.
(o) This directive specifies a list of attributes which should be appended to the attribute list given for each element. The effect is to make every single element in the abstract syntax searchable by way of the given attributes. This directive provides an efficient way of supporting free-text searching across all elements. However, it does increase the size of the index significantly. The attributes can be qualified with a structure, as in the elm directive below.
(o,r) Adds an element to the abstract record syntax of the schema. The path follows the syntax which is suggested by the Z39.50 document - that is, a sequence of tags separated by slashes (/). Each tag is given as a comma-separated pair of tag type and -value surrounded by parenthesis. The name is the name of the element, and the attributes specifies which attributes to use when indexing the element in a comma-separated list. A ! in place of the attribute name is equivalent to specifying an attribute name identical to the element name. A - in place of the attribute name specifies that no indexing is to take place for the given element. The attributes can be qualified with field types to specify which character set should govern the indexing procedure for that field. The same data element may be indexed into several different fields, using different character set definitions. See the section Field Structure and Character Sets. The default field type is "w" for word.
The following is an excerpt from the abstract syntax file for the GILS profile.
name gils
reference GILS-schema
attset gils.att
tagset gils.tag
varset var1.var
maptab gils-usmarc.map
# Element set names
esetname VARIANT gils-variant.est # for WAIS-compliance
esetname B gils-b.est
esetname G gils-g.est
esetname F @
elm (1,10) rank -
elm (1,12) url -
elm (1,14) localControlNumber Local-number
elm (1,16) dateOfLastModification Date/time-last-modified
elm (2,1) Title w:!,p:!
elm (4,1) controlIdentifier Identifier-standard
elm (2,6) abstract Abstract
elm (4,51) purpose !
elm (4,52) originator -
elm (4,53) accessConstraints !
elm (4,54) useConstraints !
elm (4,70) availability -
elm (4,70)/(4,90) distributor -
elm (4,70)/(4,90)/(2,7) distributorName !
elm (4,70)/(4,90)/(2,10 distributorOrganization !
elm (4,70)/(4,90)/(4,2) distributorStreetAddress !
elm (4,70)/(4,90)/(4,3) distributorCity !
This file type describes the Use elements of an attribute set. It contains the following directives.
(m) This provides a shorthand name or description for the attribute set. Mostly useful for diagnostic purposes.
(m) The reference name of the OID for the attribute set.
(o,r) This directive is used to include another attribute set as a part of the current one. This is used when a new attribute set is defined as an extension to another set. For instance, many new attribute sets are defined as extensions to the bib-1 set. This is an important feature of the retrieval system of Z39.50, as it ensures the highest possible level of interoperability, as those access points of your database which are derived from the external set (say, bib-1) can be used even by clients who are unaware of the new set.
(o,r) This repeatable directive introduces a new attribute to the set. The attribute value is stored in the index (unless a local-value is given, in which case this is stored). The name is used to refer to the attribute from the abstract syntax.
This is an excerpt from the GILS attribute set definition. Notice how the file describing the bib-1 attribute set is referenced.
name gils
reference GILS-attset
include bib1.att
att 2001 distributorName
att 2002 indexTermsControlled
att 2003 purpose
att 2004 accessConstraints
att 2005 useConstraints
This file type defines the tagset of the profile, possibly by referencing other tag sets (most tag sets, for instance, will include tagsetG and tagsetM from the Z39.50 specification. The file may contain the following directives.
(m) This provides a shorthand name or description for the tag set. Mostly useful for diagnostic purposes.
(o) The reference name of the OID for the tag set. The directive is optional, since not all tag sets are registered outside of their schema.
(m) The type number of the tagset within the schema profile (note: this specification really should belong to the .abs file. This will be fixed in a future release).
(o,r) This directive is used to include the definitions of other tag sets into the current one.
(o,r) Introduces a new tag to the set. The number is the tag number as used in the protocol (there is currently no mechanism for specifying string tags at this point, but this would be quick work to add). The names parameter is a list of names by which the tag should be recognized in the input file format. The names should be separated by slashes (/). The type is th recommended datatype of the tag. It should be one of the following:
The following is an excerpt from the TagsetG definition file.
name tagsetg
reference TagsetG
type 2
tag 1 title string
tag 2 author string
tag 3 publicationPlace string
tag 4 publicationDate string
tag 5 documentId string
tag 6 abstract string
tag 7 name string
tag 8 date generalizedtime
tag 9 bodyOfDisplay string
tag 10 organization string
The variant set file is a straightforward representation of the variant set definitions associated with the protocol. At present, only the Variant-1 set is known.
These are the directives allowed in the file.
(m) This provides a shorthand name or description for the variant set. Mostly useful for diagnostic purposes.
(o) The reference name of the OID for the variant set, if one is required.
(m,r) Introduces a new class to the variant set.
(m,r) Addes a new type to the current class (the one introduced by the most recent class directive). The type names belong to the same name space as the one used in the tag set definition file.
The following is an excerpt from the file describing the variant set Variant-1.
name variant-1
reference Variant-1
class 1 variantId
type 1 variantId octetstring
class 2 body
type 1 iana string
type 2 z39.50 string
type 3 other string
The element set specification files describe a selection of a subset of the elements of a database record. The element selection mechanism is equivalent to the one supplied by the Espec-1 syntax of the Z39.50 specification. In fact, the internal representation of an element set specification is identical to the Espec-1 structure, and we'll refer you to the description of that structure for most of the detailed semantics of the directives below.
NOTE: Not all of the Espec-1 functionality has been implemented yet. The fields that are mentioned below all work as expected, unless otherwise is noted.
The directives available in the element set file are as follows:
(o) If variants are used in
the following, this should provide the name of the variantset used
(it's not currently possible to specify a different set in the
individual variant request). In almost all cases (certainly all
profiles known to us), the name Variant-1
should be given here.
(o) This directive provides a default variant request for use when the individual element requests (see below) do not contain a variant request. Variant requests consist of a blank-separated list of variant components. A variant compont is a comma-separated, parenthesized triple of variant class, type, and value (the two former values being represented as integers). The value can currently only be entered as a string (this will change to depend on the definition of the variant in question). The special value (@) is interpreted as a null value, however.
(o,r) This corresponds to a simple element request in Espec-1. The path consists of a sequence of tag-selectors, where each of these can consist of either:
The occurrences-specification can be either the string all
, the
string last
, or an explicit value-range. The value-range is
represented as an integer (the starting point), possibly followed by a
plus (+) and a second integer (the number of elements, default being
one).
The variant-request has the same syntax as the defaultVariantRequest above. Note that it may sometimes be useful to give an empty variant request, simply to disable the default for a specific set of fields (we aren't certain if this is proper Espec-1, but it works in this implementation).
The following is an example of an element specification belonging to the GILS profile.
simpleelement (1,10)
simpleelement (1,12)
simpleelement (2,1)
simpleelement (1,14)
simpleelement (4,1)
simpleelement (4,52)
Sometimes, the client might want to receive a database record in a schema that differs from the native schema of the record. For instance, a client might only know how to process WAIS records, while the database record is represented in a more specific schema, such as GILS. In this module, a mapping of data to one of the MARC formats is also thought of as a schema mapping (mapping the elements of the record into fields consistent with the given MARC specification, prior to actually converting the data to the ISO2709). This use of the object identifier for USMARC as a schema identifier represents an overloading of the OID which might not be entirely proper. However, it represents the dual role of schema and record syntax which is assumed by the MARC family in Z39.50.
NOTE: The schema-mapping functions are so far limited to a straightforward mapping of elements. This should be extended with mechanisms for conversions of the element contents, and conditional mappings of elements based on the record contents.
These are the directives of the schema mapping file format:
(m) A symbolic name for the target schema of the table. Useful mostly for diagnostic purposes.
(m) An OID name for the target schema. This is used, for instance, by a server receiving a request to present a record in a different schema from the native one.
(o,r) Adds an element mapping rule to the table.
This file provides rules for representing a record in the ISO2709 format. The rules pertain mostly to the values of the constant-length header of the record.
NOTE: This will be described better. We're in the process of re-evaluating and most likely changing the way that MARC records are handled by the system.
In order to provide a flexible approach to national character set handling, Zebra allows the administrator to configure the set up the system to handle any 8-bit character set — including sets that require multi-octet diacritics or other multi-octet characters. The definition of a character set includes a specification of the permissible values, their sort order (this affects the display in the SCAN function), and relationships between upper- and lowercase characters. Finally, the definition includes the specification of space characters for the set.
The operator can define different character sets for different fields, typical examples being standard text fields, numerical fields, and special-purpose fields such as WWW-style linkages (URx).
The field types, and hence character sets, are associated with data
elements by the .abs files (see above). The file default.idx
provides the association between field type codes (as used in the .abs
files) and the character map files (with the .chr suffix). The format
of the .idx file is as follows
This directive introduces a new search index code. The argument is a one-character code to be used in the .abs files to select this particular index type. An index, roughly, corresponds to a particular structure attribute during search. Refer to section Search.
This directive introduces a sort index. The argument is a one-character code to be used in the .abs fie to select this particular index type. The corresponding use attribute must be used in the sort request to refer to this particular sort index. The corresponding character map (see below) is used in the sort process.
This directive enables or disables complete field indexing. The value of the boolean should be 0 (disable) or 1. If completeness is enabled, the index entry will contain the complete contents of the field (up to a limit), with words (non-space characters) separated by single space characters (normalized to " " on display). When completeness is disabled, each word is indexed as a separate entry. Complete subfield indexing is most useful for fields which are typically browsed (eg. titles, authors, or subjects), or instances where a match on a complete subfield is essential (eg. exact title searching). For fields where completeness is disabled, the search engine will interpret a search containing space characters as a word proximity search.
This is the filename of the character map to be used for this index for field type.
The contents of the character map files are structured as follows:
This directive introduces the basic value set of the field type. The format is an ordered list (without spaces) of the characters which may occur in "words" of the given type. The order of the entries in the list determines the sort order of the index. In addition to single characters, the following combinations are legal:
x
).
In addition, the combinations
\\, \\r, \\n, \\t, \\s (space — remember that real space-characters
may ot occur in the value definition), and \\ are recognised,
with their usual interpretation.
This directive introduces the
upper-case equivalencis to the value set (if any). The number and
order of the entries in the list should be the same as in the
lowercase
directive.
This directive introduces the character
which separate words in the input stream. Depending on the
completeness mode of the field in question, these characters either
terminate an index entry, or delimit individual "words" in
the input stream. The order of the elements is not significant —
otherwise the representation is the same as for the upercase
and
lowercase
directives.
This directive introduces a
mapping between each of the members of the value-set on the left to
the character on the right. The character on the right must occur in
the value set (the lowercase
directive) of the character set, but
it may be a paranthesis-enclosed multi-octet character. This directive
may be used to map diacritics to their base characters, or to map
HTML-style character-representations to their natural form, etc.
Converting records from the internal structure to en exchange format is largely an automatic process. Currently, the following exchange formats are supported: