[Release][Table of Contents][Programmer's Guide][User's Guide][Interface Guide][fmtseq][idxseq]
Almost anyone who performs sequence analysis or works with computerized databases eventually runs into the problem of wanting to do "something else." Whether it's extracting entries from a database based on a relationship no one has provided software for, or writing new software that can handle different sequence file formats, the question then becomes, "Are the results gained from doing this `something else' really worth the effort?" The SEQIO package is a C/C++library that has been designed to reduce that effort, both for people with little or no programming experience and for more experienced software developers.
Raw/Plain, GenBank, PIR (CODATA), EMBL, Swiss-Prot, FASTA, NBRF, IG/Stanford, ASN.1 text, GCG, MSF, PHYLIP, Clustalw, FASTA-output, BLAST-outputwhere FASTA-output and BLAST-output are the formats of the output produced by the FASTA and BLAST suites of programs.
In addition, the package also encapsulates the ability to randomly access the various databases and the ability to access single entries of a file. So, specifying "gb:humhb*" will retrieve all of the human beta globin genes from GenBank (or, more precisely, all of the GenBank entries whose locus matches "humhb*") by randomly accessing the locations of those entries in the database. And, a specification like "myseqs@3,1,5" or "myseqs@al3csa" can extract the third, first and fifth entries from file "myseqs" (in that order) or can extract the first entry containing the identifier "al3csa". Any program that uses the SEQIO package automatically has this feature (and, as discussed below, without any added complexity on the programming side).
The SEQIO package's release also comes with several complete programs. One of them is a file conversion program called "fmtseq" based on Don Gilbert's "readseq" program. One of the differences in fmtseq is its more robust interactive mode, with the following command interface display:
Input: fasta_out.30 (format: *auto*) Output: *stdout* (format: pretty) Deflts: -verbose -gapin=- -gapout=- -bigalign Options: -ask Pretty: -interleave -width=50 -colspace=10 -gapcount -nameleft=8 -nametop -interline=1 Commands (-option - set option, -no... - unset option, ? -help - list options, -r -run - execute, -q -quit - exit program, other - set input file) Enter: -nameleft=11 -width=60 -numtop -numbottom -runThe first two lines give the input and output (here the input file is "fasta_out.30" with an automatically determined format and the output is sent to standard output using the pretty-print format), the next three lines give the program options currently set (for example the `-gapin=-' and `-gapout=-' options specify what the gap symbol is in the input and what it should be when the output is produced, the `-ask' option tells the program to query the user to see which sequences should be converted, and the `-bigalign' option will be discussed in a minute), and the last lines list the possible commands, along with an example command which sets four of the pretty-print options and then runs the conversion.
The fmtseq program also has several abilities not found in other file
conversion programs. One is the ability to convert between the GCG
and non-GCG forms of GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF and
IG/Stanford entries without losing any of the header information.
Another is the ability to convert complete databases while retaining
the file structure of the database. For example, the command
"fmtseq genbank -split=fsa -format=fasta
" will convert
the GenBank database into FASTA format while maintaining the divisions
between the files (i.e., so "gbbct.fsa" contains the conversions of
"gbbct.seq", "gbest1.fsa" contains all of the conversions of
"gbest1.seq", and so on).
But, perhaps the most unusual feature is the ability to take output from the FASTA and BLAST set of programs and construct a big alignment from all of the pairwise alignments given in the file. (That is what the `-bigalign' option specifies.) The alignment is not a true multiple sequence alignment, in that no MSA algorithm is executed to produce the alignment, but the big alignment is formed by combining all of the pairwise alignments using the query sequence as the reference point, and then adding gaps as needed. And the big alignment does automatically divide the plus strand matches from the minus strand matches when reading the output of a BLASTN search.
The release also contains a number of example programs that are included to show how to use the SEQIO package. It includes programs like typeseq to simply type, or "cat" for Unix folks, its input entries (although, with the database entry access described above, this program also works like GCG's fetch to fetch entries from a database), like wcseq to count the number of sequence, entries and nucleotides/amino-acids in the input, or like grepseq which can search for fixed-width motifs and output entries whose sequences match or approximately match the motif.
#include <stdio.h> #include <stdlib.h> #include "seqio.h" int main(int argc, char *argv[]) { int len; char *seq, *entry; SEQFILE *sfp; if (argc != 3) { fprintf(stderr, "Usage: prog keyword dbase\n"); exit(1); } if ((sfp = seqfopen2(argv[2])) == NULL) exit(1); while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { if (len > 0 && strstr(seq, argv[1]) != NULL) { entry = seqfentry(sfp, NULL, 0); fputs(entry, stdout); } } seqfclose(sfp); }(The functions seqfopen2, seqfgetseq and seqfentry are SEQIO functions that open a file or database, read the next sequence and return the current sequence's entry. strstr is a C library function that finds the first occurrence of its second argument in its first argument.)
For more experienced programmers, the SEQIO package is an general purpose, efficient and cross-platform module for reading and writing a number of different sequence file formats. It can read and return sequences and entries, as well as extract other information from an entry, such as identifiers, descriptions, organism names, comments and other things. It can handle large sequences and large databases efficiently (the above program took less than 8 minutes on a DEC 5000 to search all of GenBank Release 87.0, about 800MB of text and 250MB of sequence, for a random twenty character sequence).
The package is compatible with C and C++ programs, with most of the Unix variants and with Windows NT/95. The addition of new formats or the support of other operating systems only requires willing volunteers to provide examples of the format or to help me test and port the package onto a new machine (support for VMS will be coming by the end of July).
The SEQIO package is freely available to anyone by anonymous ftp from ftp.cs.ucdavis.edu. The release a gzip'ed, tar file containing the package code, documentation files and example programs.