I'm going to concentrate on the interface itself, so in all of the examples below, you will see constants for things like filenames, formats, database names, and so on. In a normal program those things would be specified as part of the user interface, but here I'm going to make them as simple as possible in order to illustrate the interface functions more clearly.
Jim
A program that reads a sequence file or database looks a lot like using the stdio package to do normal file I/O: it opens the file or database, repeatedly calls a function to read the next sequence, and closes the file or database when it hits EOF.
int len; char *seq; SEQFILE *sfp; if ((sfp = seqfopen("my_sequences", "r", "FASTA")) == NULL) exit(1); while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { if (len > 0 && isa_match(seq, len)) { /* Found a match */ } } seqfclose(sfp);This code snippet is an example of searching the sequences of a FASTA-formatted file for the sequences that matched (whatever it is you might want to match). To read a database instead of a file, just replace the "
sfp = seqfopen(...)
" call with
"sfp = seqfopendb("genbank")
", to
read the GenBank database for example. Another simple change you can
make to this example is to read all of the file/database entries,
instead of the sequences those entries contain. To do that, simply
replace the call to "seq = seqfgetseq(sfp, &len,
0)
" with "entry = seqfgetentry(sfp, &len,
0)
" and the entry text for each entry is returned.
With either or both of these alterations, the rest of the program will work in exactly the same way, with two minor exceptions. First, when the `seqfgetentry' for `seqfgetseq' substitution is made and the entries in the file or database contain more than one sequence, `seqfgetseq' will read each sequence in the entry, whereas `seqfgetentry' will only read the entry once regardless of how many sequences occur in the entry.
Second, when searching databases using `seqfopendb', a BIOSEQ file must have been created and the "BIOSEQ" environment variable must include that file. See the file "user.doc" for infomation on how to create BIOSEQ files. That file also describes the strings `seqfopendb' can take to specifying a database search.
Second, the arguments to `seqfgetseq' are different from any of the fget* functions in the stdio package. The reason is that one of the deficiencies of the stdio package (in my opinion) is that the programmer has to worry about where and how to store the characters read in. I wanted programs using this package to worry as little as possible about how to store the read-in sequences and entries. Thus, the SEQIO package always remembers a "current" sequence and entry, and the sequence, entry or information about the sequence can be retrieved as needed.
In addition, the package can return the sequence/entry/information character strings in one of two ways, either using an internal buffer or by malloc'ing a new buffer to store the string. The third argument to `seqfgetseq' is a flag telling how the sequence text should be returned (zero specifies an internal buffer and non-zero specifies a malloc'ed buffer). So, the `seqfgetseq' call above tells the SEQIO package to read the next sequence in the file, make that the "current" sequence, and return that sequence's text using its internal buffers. As another example, the following snippet shows how to accumulate all of the sequences of a file into an array, using malloc'ed buffers so that each sequence remains available until the malloc'ed buffer is freed:
int i, len; char *seq, *seqs[400]; SEQFILE *sfp; if ((sfp = seqfopendb("swiss-prot")) == NULL) exit(1); for (i=0; i < 400 && (seq = seqfgetseq(sfp, NULL, 1)) != NULL; ) { if (*seq != '\0') seqs[i++] = seq; } seqfclose(sfp); /* Do the analysis of the sequences. */ while (i > 0) free(seqs[--i]);Giving a non-zero third argument to `seqfgetseq' tells the SEQIO package to malloc a new buffer for each sequence, so they can be kept around after the next call to the package (the internal buffers are reused, so their contents may be changed on the next call to a SEQIO function).
Also, note in this example that the second argument to the `seqfgetseq' function is NULL. One of the guarantees the SEQIO package makes is that the character strings of sequences and entries will be NULL-terminated strings, so you don't necessarily need the string's length to know where the sequence/entry ends. This also makes it easy to output the sequence or entry text, as in this version of the first example above which outputs the text of each entry whose sequence matches:
while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { if (len > 0 && isa_match(seq, len)) { /* Found a match */ entry = seqfentry(sfp, NULL, 0); fputs(entry, stdout); } }Note the use of `seqfentry' instead of `seqfgetentry'. The function `seqfentry' just returns the text of the "current" entry, and does NOT read the next entry in the file. With this use of a "current" sequence and entry, a program can get multiple pieces of information about a sequence/entry one piece at a time, without having to worry about getting everything it needs at once.
The third and fourth differences between the stdio package and the
SEQIO package in these examples are slightly harder to see. They
involve the handling of errors. The third difference is that the
program simply exits when `seqfopen' returns NULL, seemingly without
printing an error message, and the fourth difference is the use of
"len > 0
" and "*seq != '\0'
" as additional
tests to see if a sequence was returned by `seqfgetseq'.
The long answers for these differences are given in the Error Handling section and file "seqio.doc", where I talk about the error handling. The short answers are that the SEQIO package by default outputs error messages when an error occurs (but this can be disabled), and that the `seqfgetseq' and `seqfgetentry' functions are unique in that they return one of three values: 1) a string of characters on a successful read, 2) an empty string with length 0 if there is a problem reading the next sequence/entry (such as when the next entry contains no sequence), but that problem is not a fatal error, and 3) NULL if end-of-file is reached or a fatal error occurs (an error for which no more reading can be done).
The functions `seqfopen' and `seqfopendb' are the common ways to open a file/database, and `seqfgetseq', `seqfgetentry' and `seqfgetinfo' (described in the next section) are the common ways to read in the sequences/entries in the file/database. There are a couple of other ways, using the functions `seqfopen2', `seqfread' and `seqfgetrawseq. Those functions are described in file "seqio.doc".
The function `seqfrawseq' can be used to retrieve both the sequence and the alignment/structure information specified with the sequence. This function works exactly the same as `seqfentry' and `seqfsequence', except that the string returned by the functions is different. Function `seqfentry' returns the complete entry text, `seqfsequence' returns only the characters of the sequence (typically all of the alphabetic characters), and `seqfrawseq' returns the sequence and the alignment/structure characters (typically all characters except whitespace and digits).
typedef struct { char *dbname, *filename, *format; int entryno, seqno, numseqs; char *date, *idlist, *description; char *comment, *organism, *history; int isfragment, iscircular, alphabet; int fragstart, truelen, rawlen; } SEQINFO;The structure contains six fields which the SEQIO package has about the current sequence:
The other twelve fields are information extracted from the current entry (see "format.doc" for the details about which information is retrieved for each file format):
char *entry; SEQINFO *info; SEQFILE *sfp; if ((sfp = seqfopendb("genbank")) == NULL) exit(1); while ((info = seqfgetinfo(sfp, 0)) != NULL) { if (info->iscircular) { entry = seqfentry(sfp, NULL, 0); fputs(entry, stdout); } } seqfclose(sfp);and this code snippet finds and outputs the entry (or entries) with a given accession number:
char *s, *t, *idlist, *entry; SEQINFO *info; SEQFILE *sfp; if ((sfp = seqfopendb("genbank")) == NULL) exit(1); while (seqfread(sfp, 1) == 0) { idlist = seqfidlist(sfp, 0); if (idlist != NULL) { /* * Scan the idlist, looking for an identifier whose prefix is * "acc" and whose number matches the accession. */ s = idlist; while (*s) { for (t=s; *s && *s != '|'; s++) ; if (strncmp(t, "acc:X01828", 10) == 0) { entry = seqfentry(sfp, NULL, 0); fputs(entry, stdout); break; } if (*s) s++; } } } seqfclose(sfp);A couple points to note about these examples and the fields of the SEQINFO structure. First, the string `idlist' is a vertical bar separated list of identifiers, where each identifier consists of a prefix naming the database or type of identifier and a suffix giving the actual id. See file "user.doc" for a complete description of these identifiers and identifier prefixes.
Third, the functions like `seqfidlist' are similar to `seqfsequence', `seqfentry', and `seqfinfo' in that they return some information about the "current" sequence/entry. The package has one of these access functions for every field in the SEQINFO structure (i.e., `seqfdate', `seqfiscircular', ...). For the SEQINFO fields that are character strings, these functions take two arguments, where the second argument is just like the third argument of `seqfsequence' or `seqfentry'. It tells whether the package should return the character string using an internal buffer or in a malloc'ed buffer. (Again, be aware that the internal buffer strings are guaranteed to remain unchanged only upto the next call to the SEQIO package.)
Fourth, the previous point raises the question of what happens when `seqfinfo' or `seqfgetinfo' is called with a second argument of 1, and the SEQINFO structure is returned in a malloc'ed buffer. Where do the character string fields of the structure point to? And will it be hard to free up the SEQINFO structure and its character strings? When `seqinfo' or `seqfgetinfo' is called with a second argument of 1, they actually malloc one large buffer, and store both the SEQINFO structure and the character string fields in that one buffer. And since the SEQINFO structure is placed at the beginning of the malloc'ed buffer, simply free'ing the SEQINFO structure will automatically free up all of its character strings.
And fifth, note the use of `seqfread' in the second example. It was used because there is no `seqfgetidlist' function in the package. The only functions which both read the next entry/sequence and return something about that entry/sequence are `seqfgetseq', `seqfgetrawseq', `seqfgetentry' and `seqfgetinfo'. To perform searches using the other information functions, you must use one of the four entry/sequence reading functions listed in this paragraph. Also, in case the arguments to `seqfread' are confusing, the second argument to `seqfread' is NOT the same as the second argument to `seqfidlist'. The second argument to `seqfread' specifies whether to read the next sequence (if zero) or to read the next entry (if non-zero).
`Seqfmainid' and `seqfmainacc' are variations of the `seqfidlist' which only return a "main" identifier, instead of returning the whole identifier list. This is useful in cases where you don't necessarily want to search the complete list of identifiers, but just want a single identifier to associate with a sequence. `Seqfmainid' returns the "main" identifier for a sequence, which specifically is the first non-accession identifier, if one exists, or the first accession number in the entry otherwise. The `seqfmainacc' function returns the first accession number in the entry, if one exists. Both have the same arguments as `seqfidlist', and both return a NULL-terminated string containing the single identifier, with an identifier prefix. So, the example above which searches for an accession number could be rewritten as the following, if we were just looking for the entry whose main accession number is "X01828":
char *mainid, *entry; SEQFILE *sfp; if ((sfp = seqfopendb("genbank")) == NULL) exit(1); while (seqfread(sfp, 1) == 0) { if ((mainid = seqfmainid(sfp, 0)) != NULL && strncmp(mainid, "acc:X01828", 10) == 0) { entry = seqfentry(sfp, NULL, 0); fputs(entry, stdout); } } seqfclose(sfp);The function `seqfoneline' can be used to create a "oneline" description of the information for an entry. A number of programs (and a number of file formats) have situations where they would like to present the user with a relatively compact, one line description of a particular sequence. The SEQIO package defines a standard format for this type of description for biological sequence, and `seqfoneline' is the function the package provides to construct these descriptions. The argument list for `seqfoneline' is the following:
int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly);where `info' is a SEQINFO structure, `buffer' is a character buffer where the oneline description will be stored, `buflen' is the length of the buffer, and `idonly' will be discussed momentarily.
This function operates in a similar manner as `fgets', in that the string it constructs is stored in the buffer passed to it. It differs from fgets in two major respects (apart from the fact that it does no file reading). The first is that the oneline description is guaranteed to both fit in the buffer and to be NULL-terminated (i.e., no oneline description will ever be longer than "buflen-1" characters). The second is that the function returns the length of the oneline description stored in `buffer', instead of a pointer to buffer itself. Hopefully, both of these differences will be more useful in practice than the way fgets works.
The final argument to `seqfoneline' is an `idonly' flag specifying whether the "oneline description" should in fact just contain a single identifier for the sequence. This flag is useful in cases where you just want a single identifier string that is guaranteed to be no longer than a certain length (most notably in the output of the PHYLIP, Clustalw and MSF formats). When the flag is non-zero, the string stored in `buffer' is guaranteed to contain a single word identifier or description, and is guaranteed not to contain any whitespace.
The final variation on accessing information from an entry is `seqfallinfo'. This function works exactly like `seqfinfo', except that the comment field of the SEQINFO structure returned contains a different string. Using `seqfinfo', the comment string returned consists of whatever comment appears in the entry. With `seqfallinfo', the comment string contains the complete header of the entry. The specifics of what string this is depends on the particular file format, but generally it consists of all of the lines of the entry except the sequence lines.
If neither of those ways can get the information you're looking for, the third way of getting information from a sequence is to get the entry's text and scan that text for the information, as in this example which outputs all entries in the file "alu.human" of the "REPBASE" database which are classified in the "Alu-J" region:
char *entry, *s; SEQFILE *sfp; if ((sfp = seqfopendb("repbase:alu.human")) == NULL) exit(1); while ((entry = seqfgetentry(sfp, NULL, 0)) != NULL) { if (strstr(entry, "\nFT \\rpt_family=\"Alu-J\"")) fputs(entry, stdout); } seqfclose(sfp);This works, because when looking at the "alu.human" file, the sequences are classified by the line
FT \rpt_family="Alu-J"Thus, by reading each entry and doing a simple scan for that particular line, I can extract the appropriate entries. And of course, more complicated (or robust) searches of the entries could be written, but the point here is that the SEQIO package takes care of all of the file I/O and simplifies the programmer's task to just implementing the scanning.
int len; char *seq; SEQINFO *info; SEQFILE *insfp, *outsfp; if ((insfp = seqfopen("my_sequences", "r", "embl")) == NULL) exit(1); if ((outsfp = seqfopen("my_seqs.2", "w", "genbank")) == NULL) exit(1); while ((seq = seqfgetseq(insfp, &len, 0)) != NULL) { if (len > 0 && (info = seqfino(insfp, 0)) != NULL) seqfwrite(outsfp, seq, len, info); } seqfclose(insfp); seqfclose(outsfp);The SEQIO package also contains a `seqfconvert' function, which can simplify this code just a little bit (although there's not much farther that you can go):
int len; char *seq; SEQINFO *info; SEQFILE *insfp, *outsfp; if ((insfp = seqfopen("my_sequences", "r", "embl")) == NULL) exit(1); if ((outsfp = seqfopen("my_seqs.2", "w", "genbank")) == NULL) exit(1); while (seqfread(insfp, 0) != NULL) seqfconvert(insfp, outsfp); seqfclose(insfp); seqfclose(outsfp);For the function `seqfopen', its second argument is the same as the second argument to `fopen', except that `seqfopen' only supports reading ("r"), writing ("w") and appending ("a") modes. Also, when writing a file, the third `seqfopen' argument specifying the format must be given. It cannot be NULL.
So, if you want to create new entries containing information that you compute using some other method, simply declare a SEQINFO structure, fill in its fields with the strings and values you've computed, and pass it and the sequence to `seqfwrite'.
int len; char *seq; SEQINFO info; SEQFILE *insfp, *outsfp; if ((outsfp = seqfopen("new_seqs", "w", "sprot")) == NULL) exit(1); while (/* more entries to create */) { memset(&info, 0, sizeof(SEQINFO)); /* Perform some computation to get a sequence and to fill in the fields of the SEQINFO structure. */ seqfwrite(outsfp, seq, len, &info); } seqfclose(outsfp);The SEQINFO structure has been defined so that all of the default values for the fields are 0 (or NULL for character strings). Thus, setting all of the bytes of the structure to 0 sets all of the default values.
The function takes a SEQFILE pointer (open for writing), an entry and a string, and it inserts the string into the comment section of the entry as it is outputting the entry. The arguments for `seqfannotate' are the following:
int seqfannotate(SEQFILE *sfp, char *entry, int entrylen, char *newcomment, int flag)where `sfp' is the SEQFILE structure, `entry' and `entrylen' give the necessary information about the entry, `newcomment' is the string to be inserted, and `flag' tells whether or not to retain any existing comments in the entry (zero says to remove all other comments and non-zero says to retain the comments). As an example, here is the example program given at the beginnning of this file, extended so that it adds the matching positions to the entry text.
int len, entrylen; char *seq, *entry, *str; SEQFILE *sfp, *sfpout; if ((sfp = seqfopen("my_sequences.3", "r", "pir")) == NULL) exit(1); if ((sfpout = seqfopen("-", "w", seqfformat(sfp))) == NULL) exit(1); while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { if (len > 0 && (str = isa_match(seq, len)) != NULL) { /* Found a match */ entry = seqfentry(sfp, &entrylen, 0); seqfannotate(sfpout, entry, entrylen, str, 1); } } seqfclose(sfp);where the function `isa_match' now returns a character string such as
"Prosite Pattern: GLYCOSAMINOGLYCAN (S-G-x-G)\nMatches: 10-16, 503-508.\n"instead of just a boolean flag. (Note: the string here is a literal version of the string that isa_match might return.)
One thing that might appear to be missing from the `seqfannotate' call is the format of the entry being passed to it. The format for the passed in entry is assumed to be the same as the format that was specified when the SEQFILE structure was opened for writing. (Note the use above of `seqfformat' when opening the output, and recall that giving "-" to `seqfopen' tells it to open standard input or standard output.) If the entry is not in the correct form, a parse error will occur and nothing will be output.
With the example program above, if the entry text given to `seqfannotate' were the following (to use an actual PIR entry):
ENTRY CCMQR #type complete TITLE cytochrome c - rhesus macaque (tentative sequence) ORGANISM #formal_name Macaca mulatta #common_name rhesus macaque DATE 17-Mar-1987 #sequence_revision 17-Mar-1987 #text_change 05-Aug-1994 ACCESSIONS A00003 REFERENCE A00003 #authors Rothfus, J.A.; Smith, E.L. #journal J. Biol. Chem. (1965) 240:4277-4283 #title Amino acid sequence of rhesus monkey heart cytochrome c. #cross-references MUID:66045191 #contents Compositions of chymotryptic peptides and sequences of residues 55-61 and 68-70 #accession A00003 ##molecule_type protein ##residues 1-104 ##label ROT CLASSIFICATION #superfamily cytochrome c; cytochrome c homology KEYWORDS acetylated amino end; electron transfer; heme; mitochondrion; oxidative phosphorylation; respiratory chain FEATURE 1 #modified_site acetylated amino end (Gly) #status experimental\ 14,17 #binding_site heme (Cys) (covalent) #status predicted\ 18,80 #binding_site heme iron (His, Met) (axial ligands) #status predicted SUMMARY #length 104 #molecular-weight 11605 #checksum 9512 SEQUENCE 5 10 15 20 25 30 1 G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P 31 N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I T W G 61 E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E 91 R A D L I A Y L K K A T N E ///the output from `seqfannotate' would be
ENTRY CCMQR #type complete TITLE cytochrome c - rhesus macaque (tentative sequence) ORGANISM #formal_name Macaca mulatta #common_name rhesus macaque DATE 17-Mar-1987 #sequence_revision 17-Mar-1987 #text_change 05-Aug-1994 ACCESSIONS A00003 REFERENCE A00003 #authors Rothfus, J.A.; Smith, E.L. #journal J. Biol. Chem. (1965) 240:4277-4283 #title Amino acid sequence of rhesus monkey heart cytochrome c. #cross-references MUID:66045191 #contents Compositions of chymotryptic peptides and sequences of residues 55-61 and 68-70 #accession A00003 ##molecule_type protein ##residues 1-104 ##label ROT COMMENT Prosite Pattern: GLYCOSAMINOGLYCAN (S-G-x-G) Matches: 10-16, 503-508. SEQIO annotation, lines 1-2. 02-Feb-1996 CLASSIFICATION #superfamily cytochrome c; cytochrome c homology KEYWORDS acetylated amino end; electron transfer; heme; mitochondrion; oxidative phosphorylation; respiratory chain FEATURE 1 #modified_site acetylated amino end (Gly) #status experimental\ 14,17 #binding_site heme (Cys) (covalent) #status predicted\ 18,80 #binding_site heme iron (His, Met) (axial ligands) #status predicted SUMMARY #length 104 #molecular-weight 11605 #checksum 9512 SEQUENCE 5 10 15 20 25 30 1 G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P 31 N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I T W G 61 E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E 91 R A D L I A Y L K K A T N E ///Note the new COMMENT section between the REFERENCE and CLASSIFICATION sections. And when read back in again, the string returned by `seqfcomment' would be the string
"Prosite Pattern: GLYCOSAMINOGLYCAN (S-G-x-G)\nMatches: 10-16, 503-508.\n"Exactly what was inserted (because the original entry had no other comments).
This section discusses the four of the five functions related to the BIOSEQ standard for specifying and searching databases. I assume in this section that you have read the parts of "user.doc" that relate to the BIOSEQ standard and have some idea about what a BIOSEQ file looks like. Please go read that text first.
The five BIOSEQ functions that are included in the SEQIO package (and in fact make up all of its functionality except for the standard itself) are `bioseq_read' which reads the BIOSEQ files, `bioseq_check' which can check to see if a database search specifier is valid, `bioseq_info' which is used to get an information field from a BIOSEQ entry, `bioseq_parse' which is used to get the list of files specified by a database search. and `bioseq_matchinfo' which is used to determine which BIOSEQ entry for a database has an information field with a particular value. This section talks about all of these functions except `bioseq_matchinfo'.
The function `bioseq_read' takes in the name of a file, reads the BIOSEQ entries in the file, checks the syntax of those entries, and stores all of the entry information in internal data structures. Those data structures are then used by the `bioseq_info', `bioseq_matchinfo' and `bioseq_parse' functions.
By default, the first files read are always the files specified by the "BIOSEQ" environment variable, if it is defined. This is done before any of the bioseq_* functions perform their operation. Then, each call to `bioseq_read' reads subsequent files.
The internal data structure used by the package is a list of the read-in entries, and the determination of which entry a database search specification refers to is performed by searching through the list. The entries in the list are stored in reverse order of the calls to bioseq_read, but in the given order within a specific call to bioseq_read. So, the first entry checked is always the first entry of first file from the last call to bioseq_read. From there, the rest of the entries in that last call are checked, and after the last entry of that last call, the first entry of the next to last call to bioseq_read is checked. This way, the later calls to `bioseq_read' will have priority over the previous calls to `bioseq_read' (or the "BIOSEQ" env. variable files), in case of duplicates.
Therefore, if you're writing a program and you want to allow the user to have multiple ways to specify BIOSEQ files (such as the BIOSEQ environment variable, plus other user-specified or program-specific files), use `bioseq_read' to read in the files in increasing priority, and the SEQIO package will always pick the highest priority BIOSEQ entry for each database. And, if you want the files specified by the "BIOSEQ" env. variable to have a higher priority than other files, simply call `bioseq_read' to reread the environment variable value. A BIOSEQ file can always be read in more than once, and the latest read will always override the entries from the previous read (unless the names of the BIOSEQ entries have changed between reads).
The function `bioseq_check' takes a database search specifier and checks whether it refers to a known database (i.e., whether a BIOSEQ entry exists for that database). It returns non-zero if the BIOSEQ entry exists, and zero otherwise. This can be used for a quick error check testing whether the specifier given by the user is valid or not.
The function `bioseq_info' is used to get the text from an information field in the BIOSEQ entry for a database. These information fields provide an easy way for the user to pass database-specific information to your program. One example of this is to allow the user to specify some command line options using an information field specific to the database. This way, the user can "tune" the program for each database, without having to always keep track of what option values must be specified for each database.
The SEQIO package also "defines" several information fields that it uses when performing database searches. These fields are `Name', `Format', `Alphabet', `IdPrefix' and `Index'. The `Name' field gives the name of the database, and its presence distinguishes BIOSEQ entries for databases from entries for personal collections of files. The `Format' and `Alphabet' fields specify the format for the database files and the alphabet for the database sequences, respectively. The `IdPrefix' field specifies the identifier prefix that should be given to the main identifier in each entry. The `Index' field specifies the name of the file which indexes all of the database's entries (see "idxseq.doc" for more information about the index files).
(NOTE: Information fields can only be "defined" in the sense that the user can be asked to place the requested text in information fields for the specified keywords. There is nothing requiring those fields to be there or restricting what text the user puts there, except maybe that improper text will trigger an error in the package or your program.)
The `bioseq_parse' function is the function used to parse database search specifications and determine the list of files that should be read in that search. This function (along with the `bioseq_info' function for the four information fields above) is used by `seqfopendb' to open a database search. In fact, that initial example opening a database search could be replaced with the following code snippet, and it would perform the same operations (with one exception noted below):
int len; char *s, *t, *files, *seq; SEQFILE *sfp; /* * The next 9 lines replace the lines: * if ((sfp = seqfopendb("genbank")) == NULL) * exit(1); */ if ((files = bioseq_parse("genbank")) == NULL) exit(1); for (s=files; *s; s++) { for (t=s; *s != '\n'; s++) ; *s = '\0'; if ((sfp = seqfopen(t, "r", NULL)) == NULL) exit(1); while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { if (len > 0 && isa_match(seq, len)) { /* Found a match */ } } seqfclose(sfp); } free(files);The string returned by `bioseq_parse' is a list of the database's files to be read, where each filename is terminated by a newline character (including the last filename), and the whole string is terminated by a NULL character. This string is stored in a malloc'ed buffer, and so must be freed when no longer useful. (Why newline? Hey, it probably won't appear in a filename, it's different from '\0' and it makes printing the list of files look nice. Got better reasons for some other character?)
The example above opens the same set of files and reads the same sequences. The only potential difference between the execution of that example and the example using `seqfopendb' is that the SEQIO package will not know about the four information fields associated with the database, and so minor differences may appear in the results (very minor differences in the fields of any SEQINFO structure and any output generated by SEQIO). This information could be included in the example using `bioseq_info', `seqfsetdbname', `seqfsetidpref' and `seqfsetalpha', as follows:
char *format, *dbname, *alpha, *idprefix; if ((files = bioseq_parse("genbank")) == NULL) exit(1); format = bioseq_info("genbank", "Format"); dbname = bioseq_info("genbank", "Name"); alpha = bioseq_info("genbank", "Alphabet"); idprefix = bioseq_info("genbank, "IdPrefix"); for (s=files; *s; s++) { for (t=s; *s != '\n'; s++) ; *s = '\0'; if ((sfp = seqfopen(t, "r", format)) == NULL) exit(1); if (dbname != NULL) seqfsetdbname(sfp, dbname); if (alpha != UNKNOWN) seqfsetalpha(sfp, alpha); if (idprefix != NULL) setfsetidpref(sfp, idprefix); while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { if (len > 0 && isa_match(seq, len)) { /* Found a match */ } } seqfclose(sfp); } free(files); if (format != NULL) free(format); if (dbname != NULL) free(dbname); if (alpha != NULL) free(alpha); if (idprefix != NULL) free(idprefix);Note that the format string returned by `bioseq_info' is also returned in a malloc'ed buffer, and so must be freed after its use.
The image I had when designing the error handling of the package was that when the package is being used to create a "quick and dirty" program that is written just to quickly get information or entries from a database or file, the SEQIO package should do as much as possible to descriptively report and properly handle errors. However, when the package is used to create robust application software with either a command line or a windowing user interface, the programmer should have the ability to disable some or all of that reporting/handling mechanism and replace it with their own error handling routines.
By default, when the SEQIO package detects an error, it first sets the values of variables `seqferrno' and `seqferrstr' to an integer error value and the text of an error message, respectively. These variables are defined in "seqio.h" as extern variables, so you have access to their values at all times. (See file "seqio.doc" for a more complete description of the values `seqferrno' can take.) The next thing that the package does is output an error message on standard error. And finally, depending on the seriousness of the error, the package may either return an error value as the result of the SEQIO function call or it may exit the program.
Obviously, the outputting of an error message or the program exiting can affect your user interface, so I've tried to design the package so you can either work with these actions more easily or disable them easily. The first thing I've done is try to write all of the error messages so that they would be comprehensible to the user of your program, who may not know about a SEQIO package. I could not handle all of the cases (in particular, the error message from calls to `seqfparseent' and `seqfannotate' are not as informative, because those functions are not given any information that originally came from the user, such as a filename). But, for the most part, the error messages should not be incomprehensible. If you do find an error message that you think could be improved, please send an message to knight@cs.ucdavis.edu.
The second thing I've done is to limit the times when the package exits the program only to when (1) the package detects that no more memory is available or when (2) it detects an bug in the package code. Thus, (hopefully) there will be few occasions when the package will actually exit the program. And, typically the "quick and dirty" programs don't have any better handling of these errors.
The third thing I've done is to include a function `seqfsetperror' to allow you to redirect all of the error printing the package does. This function takes another function as its argument, and, when given that argument function, the SEQIO package will call that function for any error printing, instead of calling its default print error function. Thus, you can redirect all of the error output to an empty function, to a function that changes the text of the error messages, or to a function which pops up an error window with the text of the message.
The fourth thing I've done is to add a function `seqferrpolicy' which allows
you to disable some or all of the error output and whether the program
calls exit on memory errors and program bugs. See the file "seqio.doc" for the details on
`seqferrpolicy'. Thus, when you want to handle the error reporting
and handling yourself, the package can be told to just set
`seqferrno', set `seqferrstr' and return error values from the package
functions. And, even in that case, you still have access to the
messages that the package would have output, since that message is
stored in `seqferrstr'. So, for example, if you are writing a
windowing program and you want some but not all error messages to
appear in a popup window, you can make the call
"seqferrpolicy(PE_NONE)
", and then after the SEQIO
package calls which may trigger an error worth reporting, check the
value of seqferrno. The package is guaranteed never to output any
messages or exit the program (except if it core dumps, of course).
Ultrix, SunOS, Solaris, IRIX, Windows NT/95If your machine is not one of these, there is a chance the program may not compile on it. Based on my experience with other software I've written, my guess is that the code should compile on most of the Unix variants, with the exception that the proper include files needed to read directory files may differ from those in the code. On non-Unix variants, the code probably will not compile, as the code dealing with directory files is specifically geared for the Unix and Windows operating systems.
If you do have a machine not on the list, are not able to compile it and want to port it, first send me mail (at knight@cs.ucdavis.edu). I am very interesting in getting the program to work on as many systems as possible, and will try to help as much as possible (including implementing any changes on my latest version of the code and immediately sending you a personal release, so that you would not have to wait until the next version of the code came out). Then, check these list of things below, which may narrow down where the problem lies.
First, the current version of the code uses the following include files:
#include <stdio.h> #include <stdlib.h> #include <ctype.h> #include <fcntl.h> #include <stdarg.h> #include <string.h> #include <time.h> #include <errno.h> #include <sys/types.h> #include <sys/stat.h> #ifdef __unix #include <dirent.h> #ifdef SYSV #include <sys/dirent.h> #endif #endif #ifdef WIN32 #include <windows.h> #endif #include "seqio.h"plus, the following include file
#include <sys/mman.h>is ifdef'ed inside the preprocessor define value ISMAPABLE (see below for the discussion of the `mmap' system call and ISMAPABLE).
If your machine does not have some of these includes, take them out, figure out which variable/functions needed those includes, and then figure out which include files your system needs to declare those variables/functions.
Second, here is a complete list of the external variables and function calls used by the bulk of my program.
* Current set of external calls in main section of code: * exit, fclose, fopen, fputc, fputc, fprintf, free, fwrite, * getenv, getpagesize, isalpha, isalnum, isdigit, isspace, * malloc, memcpy, memset, realloc, sizeof, sprintf, * strcpy, strcmp, strlen, strncmp, tolower, va_arg, va_end, * va_start, vsprintf * mmap, munmap (these are ifdef'd inside `ISMAPABLE') * * Current set of (unusual?) data-structures/variables in main section: * errno, va_list, __LINE__, * caddr_t (this is ifdef'd inside `ISMAPABLE')In addition, I've encapsulated a lot of the system operations into functions at the end of the file "seqio.c". My assumption was that the functions and variables above are common to most or all machines, whereas the functions and variables below are more machine specific. So, I put all of the machine specific code at the end of the file, where it is much easier to find. Here is a list of all of the functions/variables/structures made in these encapsulated functions:
* Current set of external calls in end section of code: * close, ctime, open, read, stat, time * * closedir, opendir, readdir (these are ifdef'd inside `__unix') * * GetCurrentDirectory, SetCurrentDirectory, * FindFirstFile, FindNextFile, CloseHandle * (these are ifdef'd inside `WIN32') * * Current set of (unusual?) data-structures/variables in end section: * stat structure, time_t, stdin, stdout, stderr * DIR, dirent structure (these are ifdef'd inside `__unix') * WIN32_FIND_DATA, HANDLE (these are ifdef'd inside `WIN32')If any of these functions or variables are not supported on your machine, please let me know and we can figure out how to work around them.
Here are some additional tips and requirements for the package:
#include <sys/types.h> #include <dirent.h> #ifdef SYSV #include <sys/dirent.h> #endif
These include files are compatible with SunOS, SOLARIS, Ultrix, OSF, DYNIX (or whatever the Sequent's Unix variant is called), IRIX and HPUX. I have tested the directory include files on all these.
This variable is used in all of the BIOSEQ processing, which must know the format of directory pathnames. If directory paths use some format other than a string of names separated by the directory character (as the VMS systems do), we'll have to work together to reimplement the BIOSEQ processing.
I have encapsulated all of the code dealing with the `mmap' call inside a preprocessor define value ISMAPABLE, and at the beginning of "seqio.c", I include an ifdef expression which, for the systems that support the `mmap' call, defines ISMAPABLE. So, another way you can check to see if the `mmap' call on your system exists is to compile the program with the -DISMAPABLE option and see if it compiles. If so, please send me mail so I can add that system to the ifdef expression that turns on the mmap'ing.