Please feel free to copy and/or edit this info into your documentation (as long as you acknowledge that your program uses the SEQIO package, any and all documentation that accompanies the package can be included in your documentation).
Jim
The single entry access specification is placed at the end of a filename, using an ampersand `@', and is a comma separated list of elements. Any ampersand appearing in a filename given to the program will be treated as marking the beginning of a single entry access specification (so, don't create any files using ampersands).
There are three types of elements. The first consists only of a number, and it specifies that number's entry in the file, i.e., the specification "@3" specifies the third entry. The second type specifies the entry's byte offset in the file and is a number preceeded by a hatch `#', such as "@#37842". Typically, this form is only used by the program itself in order to translate a database identifier into its entry's location in the database files. The third type consists of an identifier that may or may not be preceeded by an identifier prefix, such as "@al3csa" or "@gb:humhba1". If the identifier prefix is not present, then any identifier for the entry (or one of the sequences in a multiple sequence alignment entry) that matches the element specifies the entry to access.
In all three cases, only a single entry is specified by each element in the list. If more than one entry matches the identifier, then only the first matching entry is accessed. Also, these access specifications specify entries, and not sequences. Thus, with a multiple sequence alignment file format and an identifier element like "phylip_align@al3csa", the access will retrieve the complete entry which contains a sequence whose identifier is "al3csa". It will NOT extract just that particular sequence from the multiple alignment entry.
The complete details for how a database search specifier is parsed and translated into entries of the database is better left for the section on the BIOSEQ standard for describing databases given at the end of this file. We'll only cover the general idea here.
The simpler form of a database specifier is one that does not contain a colon `:'. In this simpler form, the specifier can only specify either a complete database or a suffix alias (which is a database name immediately followed by a suffix describing a part of the database, like "pir1" for the first section of the PIR database or "gbest" for the EST section of GenBank). The suffix aliases that can be used with a database name depend on the contents of the BIOSEQ entry for that database. See the BIOSEQ standard for the description of what suffix aliases look like in a BIOSEQ entry.
The other form of a database specifier is where the specifier begins with either a database name or an identifier prefix, contains a `:', and then contains a comma separated list of files, aliases and database identifiers. With this type of database specifier, the database names and identifier prefixes are treated as essentially synonymous, and you can use either one interchangeably, governed by the following search rules:
Once the database name is determined, the BIOSEQ entries are searched for two pieces of information, a BIOSEQ entry that describes the files and aliases for the database and a BIOSEQ entry that gives an index file for the database. An index file is a file created by the idxseq program and specifies the location of every entry in the database. The "Index" information field of a BIOSEQ entry names the index file for the entry's database.
After this search, each element of the comma separated list of files, aliases and identifiers (remember them?) is searched first against the list of files and aliases (if such a BIOSEQ entry was found) and then against the index file (if there is an index file). Whatever matches each element is treated as the result of the database search specifier. For more on how this matching occurs (since the filenames and identifiers may contain wildcard characters), see the section at the end of this file on Database Search Specifiers.
The most unusual file "format" listed above is the `GCG-*' format. This actually refers to a set of formats specifying the GCG forms of the GenBank, EMBL, Swiss-Prot, PIR, NBRF, FASTA and IG/Stanford formats. In implementing the basic GCG format, the program includes special consideration for these formats, because the GCG format can include the complete header for entries in these formats. Thus, entries in the GenBank and GCG-GenBank formats (say) contain the same header lines, and differ only in how the sequence lines are formatted.
For that reason, the program closely relates the non-GCG and GCG forms of those seven file formats, and, more importantly, distinguishes those seven GCG formats from the generic GCG format (where the header lines of an entry are just treated as unstructured comments). The list above uses `GCG-*' because all of the alternative names can replace the `*' in a valid format name (like "GCG-GB", "GCG-pearson" or "GCG-igold", the last of which is described below).
In addition to these formats, there are also some other file "formats" that are actually variations of these formats but were created either to improve the program's running time or to support different versions of the file formats. These variations are described just after the short descriptions of the file formats above.
A file's or database's format will always be specified using these strings (or the strings below). There are no "format id numbers" for the various formats. Also, when naming a format, it can be given using any combination of uppercase and lowercase characters. The matching of format names is case-insensitive.
When the program executes and attempts to read a file of sequences or a database, the format of the file either is assumed to be in the specified format or is determined automatically if no format is specified. This automatic determination of a file's format should work with any properly formatted file in the formats above, with the exception of the Raw format. If the automatic determination cannot figure out the format of a file, the `Plain' format is used and a warning message may be output.
(NOTE: The ASN.1 file format does not support all of the various ASN.1 text files, just the "Bioseq-set" files that only contain "Bioseq-set.seq-set.seq" records. Also, the program does not support either the ASN.1 object file format or the compressed object file format found on the Entrez CD-ROM.)
(NOTE: The program automatically distinguishes between the interleaved and sequential formats. Also, the program can handle the extra information included by an entry from the PHYLIP 'A', 'C', 'F', 'M', 'U' and 'W' options.)
My suggestion is that these formats only be used when searching the actual databases, and the basic file formats be used the rest of the time. The difference in time only becomes significant when reading files in the multi-megabyte range.
There are also format variants which have been added to account for FASTA, NBRF and IG/Stanford format limitations commonly in use. For FASTA and IG/Stanford, the limitation is that only one header line (any line beginning with a '>' or ';') may appear in the entry. For NBRF, the limitation is that no lines like "C;Accession:" or "C;Comment:" may appear after the sequence. The basic implementations do not follow these limitations, although entries which do follow the limitations can be correctly read. The formats below use a different output function which does follow the limitations. This makes the outputted entries readable by other programs that cannot handle anything other than the limited format.
Unlike the "fast" format variants above, these format variants are included in the GCG-* set of formats. So entries can be output in GCG-fastaold or GCG-NBRF-old format and the program will combine the restrictions on the header output with the GCG format for output the sequence lines.
The program tries to use a common set of identifier prefixes, and when an entry contains an identifier without a prefix, the program tries to attach an prefix as best it can based on the entry's format and any database information about that entry. In addition, the program uses these common identifier prefixes when trying to determine what entries an input filename or database specification refer to.
The set of identifiers that the program expects is the following (where the database name corresponding to the identifier prefix is given in parentheses):
My hope is that these identifier prefixes can become a common standard that everyone uses to specify the origin of an identifier (thus reducing the problem of creating a single common identifier valid across all databases). And, if you do create a new database, please create or define a unique identifier prefix to use with your database entries.
When the program finds an identifier in an entry that does not have an identifier prefix, it tries to attach a prefix to the entry. If the program is performing a database search, and the "IdPrefix" information field for that database is set, then that identifier prefix will be used. This provides a simple way for you to attach your unique identifier prefix to the entries of your database. It also provides a way to retain information about entries that you've extracted from a database and have put into your personal collection of sequences. Just copy the "IdPrefix" information field from the database's BIOSEQ entry into the BIOSEQ entry for your collection (assuming all of the sequences come from the same database).
If no "IdPrefix" field exists (either because no database search is being done or because no such information field has been given) and an identifier is seen without a prefix, then the program assumes that the identifier comes from the database most associated with that file format. So, identifiers in GenBank formatted entries are given "gb", in EMBL formatted entries are given "embl", and so on. The two exceptions to this rule are the NBRF format, whose entries get an "oth" prefix (for an unknown database), and the EMBL/Swiss-Prot format, where the program looks at the structure of the entry and determines as best it can whether the entry is an EMBL entry, a Swiss-Prot entry, an EPD entry or some other entry. The attached prefix is given accordingly.
The goals for this standard one-line format are the following:
gb:A02201|acc:A02201 DNA for immF plypeptide - Phage phi-105, 664 bp. embl:CLEGCGA chloroplast, complete genome - green algae (E.gracilis), 143172 bp (circular DNA). African green monkey alpha-DNA - Cercopithecus aethiops, 208 bp (DNA). pir:CCCZ|acc:A00002 cytochrome c (tentative sequence) - chimpanzee ~V01289 Yeast gene for actin sp:10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10), 75 aa. gi|77963 nifS protein - Bradyrhizobium japonicum, 11 bp (fragment, 582230BE checksum)The format of each of the sections, along with how the boundaries between sections are determined, are the following:
This section is considered to appear in the line if 1) the third, fourth or fifth character is a ':', 2) the first character is a '~', or 3) the second or third character is a '|'. This covers the three variations. The section ends at the first whitespace character.
If such a section appears at the end of the line, then the beginning of the section is marked by the comma. If this section is found, then the string of digits gives the length of the sequence, the "bp", "aa" and "ch" strings give some information about the alphabet, and the string in the parentheses is checked for words defining the alphabet or telling whether the sequence is a fragment or a circular string. The optional string in parentheses may contain any text, except any additional parentheses.
The advantage of this format is that it packs a lot information into a single line, it is structured so that any piece of information can be unambiguously extracted from the line, and the extra syntax needed for the format (the " - " for the description/organism boundary and the comma and "aa", "bp" or "ch" for the sequence/alphabet section) is quite minimal. There is the slight disadvantage that the line sometimes is longer than 80 characters when all of the information appears on the line. But, then, there are always tradeoffs.
The BIOSEQ "standard" is mostly just a file format for describing one or more databases, along with a standard form for specifying a database search and a couple functions used by the program that read and understand the file format. You create one (or maybe a couple) of these BIOSEQ files describing the databases you have, tell the program where to locate those files, and then you can refer to and search the databases using the information from the files.
>mydatabase /databases/genbank/genpept.fasta /usr/home/knight/mydb/protein.dat ~pearson/sequences >PIR /databases/pir/pir1.dat, /databases/pir/pir2.dat, /databases/pir/pir3.datThis example describes the files for two databases, "mydatabase" and "PIR". The files in a BIOSEQ entry are separated by spaces and commas (in the standard, a comma is considered a space character). The examples given here and below all use the Unix format for specifying filenames (i.e. with '/' as the directory separator), but the files actually specified in the BIOSEQ entries should be formatted according to the operating system being used. So, for Windows NT/95, the directory separator used should be a backslash '\', and the pathnames can begin with a disk drive letter, as in "C:\databases\genbank".
Once the BIOSEQ file is given to the program, either through the use of the `BIOSEQ' environment variable or through a program option, these databases can be searched using the strings "mydatabase" or "PIR". The strings "MYDATABASE", "pir", "mYDaTaBAsE" and "Pir" can also be used as a valid database search specification (i.e., a string that specifies what database or part of a database to search), because the matching of the database search specifier to the BIOSEQ entry's database name is case-insensitive.
Also, note that the files can use the Unix shell '~' characters for
referring to home directories. This is true even on Windows machines
(when the "HOME" environment variable is set). The tilde is used
either as "~/mydb/file" to refer to files in your home directory
(i.e., the actual file is "*HOME*
/mydb/file") or as
"~pearson/sequences" to refer to files in another person's home
directory (i.e., the actual file is
"*HOMEParent*
/pearson/sequences" where HOMEParent is the
parent of the home directory). The files can also be relative paths
instead of absolute paths, however this is not recommended because
those paths will always be treated as relative to where the program
executes (which will change if you move into different directories).
The only restriction on database names is that no colons (:) can appear in the name. The only restrictions on the filenames in a BIOSEQ entry are that no whitespace characters (space, tab, newline), commas (,), asterisks (*), question marks (?), parentheses (`(', `)') or number signs (#) can appear in the filenames (these characters have special meanings), and that the filenames should refer to files that exist and can be read.
~/.bioseq,/databases/bioseq.txt,/usr/local/lib/BIOSEQWhenever the program first tries to access a database, it looks at the value of the BIOSEQ environment variable and reads each of the files in the list. Like the Unix PATH and MANPATH variables, the order of the files in the list determine the order that the program will search through the BIOSEQ entries.
Note that this means that no BIOSEQ file can have a name containing a comma. (And for Unix users: I used commas to separate the files instead of colons, because Windows, VMS and the Mac all use colons in their pathnames. So, using a colon separator would not have been portable.)
>mydatabase, mydb, proteins /databases/genbank/genpept.fasta /usr/home/knight/mydb/protein.dat ~pearson/sequencescan be referred to using "mydatabase", "mydb" or "proteins" (or any variations of upper and lower case).
>PIR: /databases/pir pir1.dat, pir2.dat, pir3.datis equivalent to the PIR entry in the first example above. Or the entry could be specified as
>PIR: /databases pir/pir1.dat, pir/pir2.dat, pir/pir3.datwhich is a useful form when the files are separated into several sub-directories under a common directory. If a root directory is specified, all of the files in the entry are assumed to be inside that directory (i.e., the path to a file is considered as "
*root*/*file*
"). Also, note that the root directory does
not end with a '/'.
On the lines beginning with a '>' (the first line of every BIOSEQ entry and the information field lines, which are described next), number signs are treated as any other character and do not begin a comment. The reason for that is so that number signs can be included as part of the information field text.
>PIR: /databases/pir >Name: PIR >Title: Protein Information Resources Databank - > Version 43.00 (December, 1994) >Alphabet: Protein >Format: pirfast >IdPrefix: pir >Index: pirindex pir1.dat, pir2.dat, pir3.datThe information fieldname can contain any character except whitespace or a ':', and the text of the information field can be any string. If the string is too long for a single line, the information field can be extended to multiple lines by beginning the second and later lines with a `>' followed by one or more spaces (as with the "Title" information field above).
When information fields are specified in a BIOSEQ entry, the program can then look for those fields by name and get the fields' text as the result of the lookup. Like the matching of database names, the matching of information field names is case-insensitive, so "Name", "NAME", "name" and "nAmE" all will match the "Name" information field in the entry above.
(Note: When a multiple line information field is accessed by a
program, the newline, `>' and initial spaces are stripped from the
string returned by the program. So, a program accessing the "Title"
field from above gets the single line:
Protein Information Resources Databank - Version 43.00 (December, 1994)
There is no way to explicitly specify a multiple line information
field to a program. The program will always see a single line.)
The program has five basic information fields that it looks for when performing a database search (plus possibly some other information fields described elsewhere in the documentation). They are
Note that the example above specifies a "pirfast" for the file format. Recall that "pirfast" is one of the variations of a file format (as discussed above in the File Formats section) which uses a fast file reading implementation. Typically, the "gbfast", "pirfast", "emblfast" and "spfast" file formats should only be used in the BIOSEQ entries for the actual GenBank, PIR, EMBL and Swiss-Prot databases.
A virtual BIOSEQ entry is an entry which only contains one or more entry names and one or more information fields. It does not contain any non-comment text in the section that normally specifies the BIOSEQ entry's files. Here is a possible virtual entry:
>PIR >Myprog-Opts: -gap 5 -indel 2 -w 20 >Matrix: PAM120 # This is a virtual entry.With this entry and the previous entry both specified for the PIR database (documentation elsewhere should describe how to specify multiple BIOSEQ files to the program and in what order they will be read), field lookups for "Myprot-Opts" and "Matrix" will use the information from this virtual entry, and database search specifications will use the other entry to find the database files to read and the other information fields.
Note that every BIOSEQ entry must have at least one line which does not begin with a '>', so a virtual entry must have one or more of either blank lines or comment filled lines.
>NFRES: /databases/NFRES all_v05/(bcta, inva, mama, orga, phga, plna, pria, roda, vrla, vrta, yeaa) cds_v05/(bctc, invc, mamc, orgc, phgc, plnc, pric, rodc, vrlc, vrtc, yeac) exo_v05/(inve, mame, orge, plne, prie, rode, vrle, vrte, yeae) ivs_v05/(invi, mami, orgi, plni, prii, rodi, vrli, vrti, yeai)In this database, the files are separated into four sub-directories, "all_v05", "cds_v05", "exo_v05" and "ivs_v05". The shorthand is the use of parentheses just after the '/' to specify that the list of files within the parentheses are files in that sub-directory.
The list of files can stretch over multiple lines and can be interspersed with comments. In other words, the text inside the parentheses has the same formatting rules as the text outside the parentheses, with the exception that aliases cannot be defined inside the parentheses (aliases are described below). In addition, this shorthand can be nested to multiple levels, such as:
>mydatabase: ~/mydbs nucleic/( human/(hum1 hum2 hum3) rodent/(rod1 rod2 rod3 rod4) ecoli/(eco1 eco2) ) protein/( human/(hum1.p hum2.p) rodent/(rod1.p rod2.p rod3.p rod4.p) ecoli/eco1.p )With this entry, an example complete pathname would be "~/mydbs/nucleic/rodent/rod2".
One restriction on the use of this shorthand is that it only can be used at the sub-directory boundary. So, the string "human/hum(1 2).p" cannot be used to specify the files "human/hum1.p" and "human/hum2.p".
So, as an example, assuming that the files specified in the "mydatabase" entry above are the only files in the listed sub-directories, then the following entry is equivalent to the previous example:
>mydatabase: ~/mydbs nucleic/(human/* rodent/* ecoli/*) protein/(human/hum?.p rodent/rod?.p ecoli/eco?.p)The wildcards can appear anywhere in the filename's path, and so for a database like PDB, whose files are structured like "02/pdb102l.ent", where "102l" is the sequence entry identifier and the sub-directory "02" are the middle two characters of that id, the following BIOSEQ entry captures PDB's structure
>PDB: /databases/pdb ??/pdb????.entdespite the fact that the PDB database contains hundreds of files. And this BIOSEQ entry would permit other files, like documentation files or index files, to be kept in /databases/pdb. In this example, please note that there is no explicit relation between the sub-directory name and the middle two characters of the four character id. That relationship must be maintained separately. This entry will match any file of the form "pdb????.ent" that is in a two character sub-directory of "/databases/pdb".
There are two types of aliases, normal aliases and suffix aliases. Normal aliases consist of the alias name, the string ":(", a space/comma separated list of files, and a ')'. An example is the following:
>mydatabase: ~/mydbs nucleic/( human/(hum1 hum2 hum3), rodent/(rod1 rod2 rod3 rod4), ecoli/(eco1 eco2) ) protein/( human/(hum1.p hum2.p), rodent/(rod1.p rod2.p rod3.p rod4.p), ecoli/eco1.p ) human:(hum1 hum2 hum3), rodent:(rod1 rod2 rod3 rod4), ecoli:(eco1 eco2)The last two lines define the aliases "human", "rodent" and "ecoli". If, with this entry, a database search specification of "mydatabase:rodent" were given to the program, the program would find this BIOSEQ entry, find the alias definition for "rodent", look for the four files "rod1", "rod2", "rod3" and "rod4", and then read the files "~/mydbs/nucleic/rodent/rod1", "~/mydbs/nucleic/rodent/rod2", "~/mydbs/nucleic/rodent/rod3" and "~/mydbs/nucleic/rodent/rod4". (For all of the details on how this is done, see the database search specification description below.)
Alias names may contain no whitespace characters (space, tab, newline), directory characters (/), number signs (#), question marks (?), asterisks (*) or tildes (~), with one exception for suffix aliases.
Suffix aliases are aliases whose names begin with a '~' character and which can be used to shorten even further the database search specification used to specify a part of a database. For example, to search the PIR database with this entry
>PIR: /databases/pir pir1.dat pir2.dat pir3.datthe search specification "pir" will search the whole database, but it would be nice to be able to specify just one of the files using "pir1" or "pir3", instead of "pir:pir1.dat" or "pir:pir3.dat". This can be done by adding the following suffix alias definitions:
~1:(pir1.dat) ~2:(pir2.dat) ~3:(pir3.dat)With these definitions, the search specification "pir1" will match to the "PIR" entry (since the entry name matches a prefix of the database search specification), look for a suffix alias definition whose string after the '~' is "1" (the rest of the database search specification), find it and then read the file "/databases/pir/pir1.dat".
In addition, suffix aliases without a suffix name (i.e., just the '~' character as the name) can be used to specify that only part of the database should be searched when given just the BIOSEQ entry's name as the search specifier. For example, in the NFRES database, all of the sequence entries are stored in the files in the "all_v05" sub-directory. Those sequence entries are duplicated and separated into the other sub-directories depending on whether the sequence is a cds, exon or intron.
>NFRES: /databases/NFRES all_v05/(bcta, inva, mama, orga, phga, plna, pria, roda, vrla, vrta, yeaa) cds_v05/(bctc, invc, mamc, orgc, phgc, plnc, pric, rodc, vrlc, vrtc, yeac) exo_v05/(inve, mame, orge, plne, prie, rode, vrle, vrte, yeae) ivs_v05/(invi, mami, orgi, plni, prii, rodi, vrli, vrti, yeai)So, in order to search the whole database, only the files in the "all_v05" directory should be read, not all of the files mentioned by the entry. This can be specified by adding the following line to the entry:
~:(bcta, inva, mama, orga, phga, plna, pria, roda, vrla, vrta, yeaa)With this suffix alias definition, when the database search specification "NFRES" is given, this will match the suffix alias instead of specifying that the whole database should be searched.
When the search specification is just a database name, the first BIOSEQ entry that has a matching name and that is not a virtual entry (meaning that the database files are specified) is the entry where the database files are found. First, the entry is checked to see if it contains a suffix alias definition whose name is just "~". If so, then the text of the alias is expanded and searched for. If no such suffix alias is found, then the set of files to be read consists of all of the files listed in the entry. If any filenames contain wildcards, then those filenames are matched against the existing files and directories.
The alias expansion process (for both normal and suffix aliases) is performed by considering the text inside the alias definition as a type 3 search specifier, and recursively searching for each element of the list inside the alias definition. The two restrictions on this are that, first, the search specifiers in the alias definition can only refer to the current entry's files (and not the files/aliases of other BIOSEQ entries), and second, only 10 levels of recursion are allowed in the alias definitions. (So, yes, you can have aliases which refer to other aliases.)
When the search specification is a database name followed by a suffix alias, the BIOSEQ entry to match is the first entry with an entry name that matches a prefix of the search specifier and with a suffix alias definition whose name matches the rest of the search specifier. When a BIOSEQ entry matches, the text of the suffix alias is expanded and searched for to get the set of files to be read. If an entry only contains a match of an entry name with a specifier prefix (and does not have a matching suffix alias), then this entry does not match and other BIOSEQ entries are checked. The program does not stop at the first entry to match a prefix of the search specifier.
The third type of database search specifier is the most complex. When the specifier is a database name followed by a ':' and a list of files, aliases and entry identifiers, the search first scans the BIOSEQ entries for any entries with an entry name exactly matching the database name. If no such entries are found, the search then tries to treat the database name as an identifier prefix. It first looks to see if the database name matches an identifier prefix given in the list above (see the section on identifier prefixes). If a match is found, the search scans the BIOSEQ entries with the corresponding database name. Otherwise, the search looks for the first BIOSEQ entry with an "IdPrefix" information field whose value matches the database name. If it finds such an entry, it uses the entry name for that BIOSEQ entry as the database name. Otherwise, an error message is triggered, saying that the program could not find a database for the search specifier.
Once at least one of a non-virtual BIOSEQ entry and an "Index" file for the database have been found, the search then goes through each of the file/alias/identifier elements of the database search specifier. It first tries to match the element against all of the files and aliases listed in the non-virtual BIOSEQ entry (if such an entry was found). The process for performing this matching is described in the next section. If no match was found by treating the element as a file or alias, the element is then treated as a database identifier and the index file is used to lookup the identifier (assuming an index file was found in the initial search). The entries of any matching identifiers in the database are considered to form the match to that element of the database search specification. If the lookup fails, then an error message is triggered, saying that the program count not find the element in the database.
Pathnames and files/aliases are distinguished by the presence or lack of a directory character in the string ('/' for Unix and '\' for Windows). If the string contains a '/', then it is matched against the complete path of each file specified in the entry. So, in this "mydatabase" entry,
>mydatabase,mydb: ~/mydbs nucleic/( human/(hum1 hum2 hum3), rodent/(rod1 rod2 rod3 rod4), ecoli/(eco1 eco2) ) protein/( human/(hum1.p hum2.p), rodent/(rod1.p rod2.p rod3.p rod4.p), ecoli/eco1.p ) human:(hum1 hum2 hum3), rodent:(rod1 rod2 rod3 rod4), ecoli:(eco1 eco2)valid complete paths are "nucleic/human/hum2" and "protein/ecoli/*.p". The path "human/hum1*" will not match anything as it does not match a complete pathname, unlike "*/human/hum1*" which matches "nucleic/human/hum1" and "protein/human/hum1.p".
If the string does not contain a '/', then it is considered either a filename or an alias and is matched against the filename of every file and the alias name of every alias definition. Thus, the database search specification "mydb:hum1" matches "nucleic/human/hum1", specifier "hum2*" matches "nucleic/human/hum2" and "protein/human/hum2.p" and specifier "human" matches the alias "human". Note that the specification "hum*" does NOT match the alias "human". Wildcards are only matched against files.
Both the filenames and pathnames can contain wildcard characters. So, what happens when both the filename/pathname search specifier and the pathname in the BIOSEQ entry contain wildcards? First, the search specifier filename/pathname is matched against the entry pathname, to see if a match is possible. Then, the entry pathname is expanded to all of the existing files which match that pathname, and each of those files is matched against the search specifier filename/pathname. Only the existing files which match both the entry pathname and the search specifier filename/pathname are included in the set of database files to be read. So, with the following BIOSEQ entry for GenBank:
>GenBank: /databases/genbank gb*.seqThe database search specifier "genbank:*s*s*" will match the files "/databases/genbank/gbest.seq", "/databases/genbank/gbsts.seq" and "/databases/genbank/gbsyn.seq", because those are the only files in the GenBank release which have the form "gb*.seq" and whose filenames contain two "s"'s.
This condition that a file included in the set of matched files must match both the entry pathname and the filename/pathname specifier also holds if only one of them contains wildcards. Thus, when the filename/pathname specifiers contain wildcards, only the files named in the BIOSEQ entry will ever be included. For example, the database search specification "mydb:nucleic/*/*" will only match the nine files in the "human", "rodent" and "ecoli" sub-directories listed in the BIOSEQ entry, even if other files occur in those sub-directories. As a corollary, the specification "database:*" will always match all of the files listed in the database's entry.