example1 keyword file example2 files... example3 files... example4 [options] files... -a string - Match an author's last name -d string - Match a substring of definition -e string - Match a substring in the entry -g string - Match an element of geneaology -j string - Match an journal name -k string - Match a keyword -o string - Match the formal organism name -r string - Match a substring of reference title typeseq files... wcseq files... grepseq [-# | -a | -d | -l | -m | -o file | -p] pattern files...
The `example2' program is a sequence information display program. It displays all of the information the SEQIO package can automatically extract from each input entry.
The `example3' program shows how to extract information from the EMBL feature tables using the SEQIO package. Specifically, the program outputs the "note" subfield of every "CDS_pept" feature of the EMBL files given to it as input. Since the example was not meant as a general feature extraction program (although it should show you how to create one if you want), the feature name, subfield, and file format were not generalized but hard-coded into the program.
The `example4' program is an entry filtering program for GenBank formatted entries. Essentially, a primitive form of the type of filtering done by SRS. In addition to the files/databases (which must be in GenBank format), options can be given to tell the program only to output entries containing the appropriate information (see the option list above and the program comments for the specifics about that information). The program will output all input entries whose information matches all of the specified options (i.e., AND'ing all of the option filters). The matching for the individual options is case-insensitive.
The program `typeseq' is essentially a combination of a `cat' program for sequences and a simple `fetch' program (like GCG's FETCH). All the program actually does is output its input entries. But, because the SEQIO package has the capability to access single entries of files and databases, the "fetch" capability is automatically included in this simple program.
The program `wcseq' counts the number of sequences, number of entries, and total size of the sequences in the input (like `wc' counts lines, words and characters). The code to this program shows how a programmer can more closely control how files and databases are read, instead of leaving all of the database translation and reading to the SEQIO package.
As I said, these programs were written mostly to show programmers how to use the SEQIO package, but the typeseq, wcseq and the grepseq program described below may also be useful utility programs.
The program options: -# - The number of errors to allow. -a - Ambiguous characters of the text should used in the matching. -d - The text will be DNA or RNA. -l - Only list the sequences that match. Don't output their entries. -m - Only allow mismatch errors. No indels. -o filename - The output file. -p - The text will be Protein. Examples: grepseq ACGYNNCGT mydna1 mydna2 mydna3 grepseq -2 -m -l TDSSYDKER swissprot grepseq -a -1 [LIVY]G[LIVMFYAG][DK]S[LIVMT]...[DET]....[LIVMF] pirThe keyword format uses the same syntax as the grep family of programs, except that it only allows character classes of the form "[...]" and "[^...]" (where the '^' at the beginning of the class specifies that the character class is the complement of the given characters). Like the grep family, a period '.' always matches anything, and the backslash character is used to specify periods, brackets and backslashes in the keyword.
(NOTE: Only keywords whose width is 31 or smaller can be handled by the program. The string itself can be longer since it can contain character classes specifying the characters for one position, but the number of positions specified by the keyword cannot be longer than 31.)
For the input files/databases, if a command line argument specifies an existing file, then that file is taken as input. Otherwise, the argument is treated as a database search specification. In order for databases to be searched, a BIOSEQ file must be created to describe all of your databases, and the environment variable "BIOSEQ" must be set to that file. Then, a database search is specified by giving (1) the name of the database to search the whole database, as in "genbank" or "swiss-prot", (2) the database name, a colon and then a list of files and aliases to search a part of the database, as in "pir:pir1.dat,pir2.dat" or "genbank:phg,inv", or (3) a database name followed by a `suffix alias', as in "pir1" or "gbphg" which searches the files defined by the suffix alias. See file "user.doc" for complete information on creating BIOSEQ files and specifying database searches.
The program then searches each input sequence to find substrings that match the keyword either exactly, by default, or within the specified number of mismatches, insertions and deletions (or just mismatches if `-m' is specified). By default, the program will ignore any ambiguous characters in the input sequences. This avoids the problem of any keyword always matching a string of N's in DNA/RNA or X's in Protein. If the `-a' option is specified, then those ambiguous characters will be included in the matching, and ambiguous characters are considered a match if the intersection of the character classes they represent is non-empty.
If the `-l' option is specified, the output will consists of a list of one-line descriptions of each matching sequence, along with the places in the sequence where the matches occur. The first 6 non-overlapping matches will be reported for each such sequence (the best scoring match of any overlapping matches is reported). If `-l' is not specified, then the entries of matching sequences are output, where the comment section of each entry is annotated with a description of where the matches occurred in that sequence.
When complete entries are being output, the format for all output is the format of the first input file. So, if the input contains files/databases with different formats, some of the entries may be converted to the format of the first input file/database in order for the output to be a single file format.
(NOTE: When insertions and deletions are allowed, the reported left endpoint of a match will not necessarily be the actual left endpoint of the match. The reported left endpoint is always computed by subtracting the keyword size from the right endpoint and adding 1. The algorithm which performs the matching cannot keep track of the left endpoints without a significant decrease in running time. The reported right endpoint, however, will always be the actual right endpoint of the match.)