SEQIO -- A Package for Sequence File I/O
QUICKREF.DOC - A Quick Reference Guide to the SEQIO Package
- seqfopen, seqfopendb, seqfopen2, seqfclose,
- seqfread, seqfgetseq, seqfgetrawseq, seqfgetentry, seqfgetinfo,
- seqfsequence, seqfrawseq, seqfentry, seqfinfo, seqfallinfo,
- seqfdbname, seqffilename, seqfformat, seqfdate,
- seqfmainid, seqfmainacc, seqfidlist,
- seqfdescription, seqfcomment, seqforganism,
- seqfiscircular, seqfisfragment, seqfalphabet,
- seqffragstart, seqftruelen, seqfrawlen,
- seqfentryno, seqfseqno, seqfnumseqs,
- seqfoneline,
- seqfsetdbname, seqfsetalpha, seqfsetidpref, seqfsetpretty,
- seqfwrite, seqfconvert, seqfputs, seqfannotate,
- seqfgcgify, seqfungcgify,
- bioseq_read, bioseq_check, bioseq_info, bioseq_matchinfo, bioseq_parse,
- seqfisafile, seqfisaformat, seqffmttype,
- seqfcanwrite, seqfcanannotate, seqfcangcgify,
- seqfbytepos, seqfparseent, asn_parse,
- seqfperror, seqfsetperror, seqferrpolicy
Defined Structures, Variables and Constants
- SEQFILE, SEQINFO (Structures)
- seqferrno, seqferrstr (Variables)
- DNA, RNA, PROTEIN, AMINO, UNKNOWN,
- E_EOF, E_NOERROR, E_OPENFAILED, E_READFAILED, E_NOMEMORY,
- E_PROGRAMERROR, E_PREVERROR, E_PARAMERROR, E_INVFORMAT,
- E_DETFAILED, E_PARSEERROR, E_DBPARSEERROR, E_DBFILEERROR,
- E_NOSEQ, E_DIFFLENGTH, E_INVINFO, E_FILEERROR
- PE_NONE, PE_WARNONLY, PE_ERRONLY, PE_NOWARN,
- PE_NOEXIT, PE_ALL
SEQFILE *seqfopen(char *filename, char *mode, char *format)
- filename - the file to be opened
- mode - "r", "w" or "a"
- format - the file format name (optional for reading)
- returns an open SEQIO file structure (or NULL on error)
Open a file for reading or writing.
SEQFILE *seqfopendb(char *dbspec)
- dbspec - a BIOSEQ database search specification
- returns an open SEQIO file structure (or NULL on error)
Open a database (or part of a database) to be read.
SEQFILE *seqfopen2(char *string)
- string - the filename (if it specifies an existing file) or
database search specifier (otherwise)
- returns an open SEQIO file structure (or NULL on error)
Open a file for reading or start a database search.
void seqfclose(SEQFILE *sfp)
- sfp - the SEQFILE structure to be closed
- returns nothing
Close a file or database search.
int seqfread(SEQFILE *sfp, int flag)
- sfp - an open SEQFILE structure
- flag - read the next sequence (if zero) or entry (non-zero)
- returns 0 on success and -1 on EOF or error
Read the next sequence or sequence entry.
char *seqfgetseq(SEQFILE *sfp, int *length_out, int newbuffer)
char *seqfgetrawseq(SEQFILE *sfp, int *length_out, int newbuffer)
char *seqfgetentry(SEQFILE *sfp, int *length_out, int newbuffer)
SEQINFO *seqfgetinfo(SEQFILE *sfp, int newbuffer)
- sfp - an open SEQFILE structure
- length_out - address where the returned string's length is
stored (if not NULL)
- newbuffer - malloc a new buffer for the object (if non-zero) or
return an internal buffer (if zero)
- returns the sequence/entry text or the SEQINFO structure (or NULL
on error)
Read the next sequence or entry and return the sequence, entry or
sequence information.
Access Functions for the Current Sequence, Entry and Information
char *seqfsequence(SEQFILE *sfp, int *length_out, int newbuffer)
char *seqfrawseq(SEQFILE *sfp, int *length_out, int newbuffer)
char *seqfentry(SEQFILE *sfp, int *length_out, int newbuffer)
SEQINFO *seqfinfo(SEQFILE *sfp, int newbuffer)
- sfp - an open SEQFILE structure
- length_out - address where the returned string's length is
stored (if not NULL)
- newbuffer - malloc a new buffer for the object (if non-zero) or
return an internal buffer (if zero)
- returns the sequence or entry text or the SEQINFO structure (or NULL
on error)
Return the sequence, raw sequence, entry or sequence information for
the current sequence.
typedef struct {
char *dbname, *filename, *format;
int entryno, seqno, numseqs;
char *date, *idlist, *description;
char *comment, *organism, *history;
int isfragment, iscircular, alphabet;
int fragstart, truelen, rawlen;
} SEQINFO;
char *seqfdbname(SEQFILE *sfp, int newbuffer)
char *seqffilename(SEQFILE *sfp, int newbuffer)
char *seqfformat(SEQFILE *sfp, int newbuffer)
int seqfentryno(SEQFILE *sfp)
int seqfseqno(SEQFILE *sfp)
int seqfnumseqs(SEQFILE *sfp)
char *seqfdate(SEQFILE *sfp, int newbuffer)
char *seqfidlist(SEQFILE *sfp, int newbuffer)
char *seqfdescription(SEQFILE *sfp, int newbuffer)
char *seqfcomment(SEQFILE *sfp, int newbuffer)
char *seqforganism(SEQFILE *sfp, int newbuffer)
int seqfiscircular(SEQFILE *sfp)
int seqfisfragment(SEQFILE *sfp)
int seqffragstart(SEQFILE *sfp)
int seqfalphabet(SEQFILE *sfp)
int seqftruelen(SEQFILE *sfp)
int seqfrawlen(SEQFILE *sfp)
- sfp - an open SEQFILE structure
- newbuffer - malloc a new buffer for the object (if non-zero) or
return an internal buffer (if zero)
- returns the information string or integer
Access functions for information about the current sequence.
char *seqfmainid(SEQFILE *sfp, int newbuffer)
char *seqfmainacc(SEQFILE *sfp, int newbuffer)
- sfp - an open SEQFILE structure
- newbuffer - malloc a new buffer for the object (if non-zero) or
return an internal buffer (if zero)
- returns the identifier stringor NULL
Access functions for the main identifiers of the current sequence.
int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly)
- info - a SEQINFO structure
- buffer - the buffer to store the oneline description
- buflen - the buffer length
- idonly - only store an identifier for the sequence
- returns the length of the string stored in buffer
Constructs a "oneline" description of a sequence and stores it in the
buffer.
void seqfsetidpref(SEQFILE *sfp, char *idprefix)
void seqfsetdbname(SEQFILE *sfp, char *dbname)
void seqfsetalpha(SEQFILE *sfp, char *alphabet)
- sfp - a SEQFILE structure open for reading
- idprefix - the identifier prefix (if not NULL or not empty)
- dbname - the current database name(if not NULL or not empty)
- alphabet - the string used to determine the alphabet when the
entry does not specify an alphabet (if not NULL or not empty)
- returns nothing
Set or unset the value for the identifier prefix,
database name or alphabet.
int seqfwrite(SEQFILE *sfp, char *seq, int seqlen, SEQINFO *info)
- sfp - a SEQFILE structure open for writing
- seq - the sequence
- seqlen - the sequence length
- info - information about the sequence
- returns 0 on success and -1 on error
Output a sequence and its information.
int seqfconvert(SEQFILE *input_sfp, SEQFILE *output_sfp)
- input_sfp - a SEQFILE structure open for reading
- output_sfp - a SEQFILE structure open for writing
- returns 0 on success and -1 on error
Convert and output the current sequence of input_sfp.
int seqfputs(SEQFILE *sfp, char *s, int len)
- sfp - a SEQFILE structure open for writing
- s - the string to output
- len - the number of chars to output (or 0, specifying to output
to the end of s)
- returns 0 on success and -1 on error
Output a string on the output stream (without any transformation or
checking).
int seqfannotate(SEQFILE *sfp, char *entry, int entrylen, char *newcomment,
int flag)
- sfp - a SEQFILE structure open for writing
- entry - the entry text to output
- entrylen - the length of the entry text
- newcomment - the comment to add to the entry
- flag - remove existing comments (if zero) or append the new
comment (if non-zero)
- returns 0 on success and -1 on error
Output the passed in entry, adding the new comment. (The entry must
be in the format specified when opening the output stream.)
int seqfgcgify(SEQFILE *sfp, char *entry, int entrylen)
- sfp - a SEQFILE structure open for writing
- entry - the entry text to convert to the GCG format
- entrylen - the length of the entry text
- returns 0 on success and -1 on error
Output the passed in entry, converting the sequence lines into the GCG
format. (The SEQFILE structure must be opened to one of the GCG-*
formats, and the format of the entry must match the `*' of the GCG-*.)
int seqfungcgify(SEQFILE *sfp, char *entry, int entrylen)
- sfp - a SEQFILE structure open for writing
- entry - the entry text to convert from the GCG format
- entrylen - the length of the entry text
- returns 0 on success and -1 on error
Output the passed in entry, converting the sequence lines back to the
original format (from the GCG format). (The format of the entry must
be one of the GCG-* formats, and the SEQFILE structure must be opened
to the `*' format matching the GCG-*.)
int bioseq_read(char *filelist)
- filelist - a comma separated list of files (must be BIOSEQ files)
- returns 0 on success and -1 on error
Read one or more BIOSEQ files and store the BIOSEQ entries in the files.
int bioseq_check(char *dbspec)
- dbspec - a database search specifier
- returns non-zero if the string refers to a known database, or
returns zero otherwise
Test if the dbspec refers to a known BIOSEQ entry.
char *bioseq_info(char *dbspec, char *fieldname)
- dbspec - a database search specifier
- fieldname - the name of the information field to be returned
- returns the text for that field of the BIOSEQ entry for that
database.
(NOTE: the returned string buffer is a malloc'ed buffer, and it must
be freed by you.)
Retrieve an information field for a BIOSEQ entry.
char *bioseq_matchinfo(char *fieldname, char *fieldvalue)
- fieldname - the name of an information field
- fieldvalue - the value that the information field should have
- returns the name of a database.
(NOTE: the returned string buffer is a malloc'ed buffer, and it must
be freed by you.)
Find the first database (in the list of BIOSEQ entries) which has an
information field matching `fieldname' and whose value matches
`fieldvalue'.
char *bioseq_parse(char *dbspec)
- dbspec - a database search specifier
- returns the list of files in a string where each file is
terminated by a newline character and the whole string is terminated
by a NULL character.
(NOTE: the returned string buffer is a malloc'ed buffer, and it must
be freed by you.)
Parse a BIOSEQ database specification and get the list
of files that should be opened and read in that search.
int seqfisafile(char *filename)
- filename - a filename (with a possible "@..." single entry access specification)
- returns non-zero (if the filename refers to an existing file) or zero
(if not)
Test whether the filename refers to an existing file (even when the
string contains a single entry access specification).
int seqfisaformat(char *format)
- format - a file format string
- returns non-zero (if the string is a valid file format) or zero
(if not)
Test whether the string is a valid file format.
int seqffmttype(char *format)
- format - a file format string
- returns the format type or T_INVFORMAT (for an invalid
format)
Return a type information value about the format.
int seqfcanwrite(char *format)
- format - a file format string
- returns non-zero (if that format is writeable) or zero (if not)
Test whether the format is writeable.
int seqfcanannotate(char *format)
- format - a file format string
- returns non-zero (if the format's entries can be annotated) or
zero (if not)
Test whether entries in the format can be annotated.
int seqfcangcgify(char *format)
- format - a file format string
- returns non-zero (if the format's entries can be gcgified/ungcgified) or
zero (if not)
Test whether entries in the format can be gcgified or ungcgified.
void seqfbytepos(SEQFILE *sfp)
- sfp - a SEQFILE structure open for reading
- returns the current byte position, or -1 on error
void seqfsetpretty(SEQFILE *sfp, int value)
- sfp - a SEQFILE structure open for writing
- value - either non-zero or zero
- returns nothing
Should whitespace be added to the output sequence?
(Plain, FASTA, NBRF and IG/Stanford formats only)
SEQINFO *seqfparseent(char *entry, int entrylen, char *format)
- entry - the text of an entry
- entrylen - the length of the entry
- format - the format of the entry
- returns a malloc'ed SEQINFO structure containing the
information about the entry.
(NOTE: This structure must be freed by you.)
Retrieve the sequence information stored in the passed in entry.
int asn_parse(char *begin, char *end, ...)
- begin - the beginning of the ASN.1 text
- end - the end of the ASN.1 text (i.e., the last character of the
ASN.1 text is at `end-1')
- ... - a NULL terminated list of arguments specifying the
sub-records to be searched and the variables to store the beginning
and end positions of the sub-record text.
(NOTE: These arguments must be given in groups of 3 until the NULL
termination, such as in
"seq.id.genbank", &gbstart, &gbend,
"seq.descr", &destart, &deend,
NULL
The format for each triple is
char *subrecord, char **begin_out, char **end_out
and either begin_out or end_out can be NULL.)
- returns a count of the number of sub-records found or a -1 on
error
Search an ASN.1 text format record (the string from
`begin' to `end') for specified sub-records.
extern int seqferrno;
extern char seqferrstr[];
External variables giving an error value and an error message string.
void seqfperror(char *s)
- s - a string (usually the program name) to be printed before the
error string
- returns nothing
Output error message, similar to the Unix perror.
void seqfsetperror(void (*perr_fn)(char *))
- perr_fn - a void function that takes a string as its argument
- returns nothing
Sets the function used by the package to output all of its error
messages. If the argument is NULL, the default function (outputting
to stderr) will be used.
int seqferrpolicy(int pe)
- pe - sets the error policy
- returns the old error policy
Sets the way the SEQIO package reports errors.
James R. Knight,
knight@cs.ucdavis.edu
June 26, 1996