SEQIO -- A Package for Sequence File I/O
CHANGES - Changes to the SEQIO Package
This file lists the changes made when going from one version to the
next. It should be detailed enough that you won't need to go through
the rest of the documentation to find out what's new.
Changes from Version 1.2 to Version 1.3
Minor Changes
- Added a new example program, example4. (Version 1.2.2)
- Fixed a bug that kept seqfentry from returning the correct entry
text when mmap'ing was used. (Version 1.2.2)
- Added the definition of FILENAME_MAX to fmtseq and idxseq, to
maintain compatibility with SunOS 4.1.1. (Version 1.2.2)
- Changed genbank_annotate and pir_annotate to be a little bit more
robust. (Version 1.2.2)
- Removed the getpagesize system call from the package, since
Solaris doesn't support it. (Version 1.2.1)
- Fixed an uninitialized variable bug in databank_fast_read (one
my version of gcc didn't catch). (Version 1.2.1)
Changes from Version 1.1 to Version 1.2
New Formats/Porting and Format Changes
Added the GCG format
Added the GCG-* format specification
of the GCG form of GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF and
IG/Stanford formats.
Added the MSF Multiple Sequence Format
Added BLASTN/BLASTP/BLASTX
program output format
Added handling for NID and PID identifiers in the GenBank and EMBL
formats (although, since neither formats' release notes explicitly
defines a PID/PI line, no such line is output by the package).
New Programs and Program Changes
Added idxseq, a database indexing program
Added a number of example programs.
Changed the name of keyword to grepseq.
Extended the fmtseq program in the following
ways:
- Added support for the GCG, GCG-*, MSF and BLAST output format
(This support includes "no loss" conversions between the non-GCG and
GCG forms of the GCG-* formats)
-
Added a run mode capability using the `-mode' option and a
user-created "fmtseq" BIOSEQ entry. This gives the use the ability to
set and unset multiple options at once.
-
Added the `-split' option for non-GCG output, so that the user can
produce a set of output files whose contents correspond to the input
files given to it (i.e., so the input file contents of "gbbct.seq" get
converted and output into a corresponding file "gbbct.fasta").
-
Extended the `-split' option for GCG output, so that each entry is
output in its own, individual file (whose name is the entry identifier
string followed by the `-split' extension).
-
Added a `-long' option, which performs the file conversions so that
each input entry's header text appears as a comment in the converted
entry.
-
Added a `-skipempty' option to the Pretty-print format, so that lines
containing only gap characters are not output (making multiple
alignments of things like the BLAST output much easier to read).
New Capabilities of the SEQIO Package
Added the ability for the user to
specify single entries of a file, specifying either by entry position,
by byte offset or by entry identifier.
Added the ability for the user to
specify single entries of a database, using the database identifiers
and random access of the database entries.
The BIOSEQ environment variable can
now take a full PATH-like specification, specifying more than one
BIOSEQ file.
BIOSEQ entries can now have
multiline information fields.
Data Structure Changes
Added the fields `rawlen' and
`fragstart' to the SEQINFO structure
Removed the `mainid' and `mainacc'
fields from the SEQINFO structure and moved all identifiers into
`idlist'.
New Functions
char *seqfgetrawseq(SEQFILE *sfp,
int *length_out, int newbuffer)
Added the `get' version of `seqfrawseq', because it's lack was annoying.
int seqffragstart(SEQFILE *sfp)
This SEQINFO access function returns the starting position of a
fragment sequence (if the sequence is a fragment and the starting
position is known).
int seqfrawlen(SEQFILE *sfp)
The SEQINFO access function returns the length of the raw sequence.
int seqfoneline(SEQINFO *info, char
*buffer, int buflen, int idonly)
This function constructs a "oneline description" of a sequence, based
on the information in the SEQINFO structure.
int seqfputs(SEQFILE *sfp, char *s,
int len)
This function outputs a string on the output stream opened for the
SEQFILE structure.
int seqfgcgify(SEQFILE *sfp, char
*entry, int entrylen)
This function takes an entry in the non-GCG form of one of the GCG-*
formats and outputs the GCG form of that entry.
int seqfungcgify(SEQFILE *sfp, char
*entry, int entrylen)
This function takes an entry in the GCG form of one of the GCG-*
formats and outputs the non-GCG form of that entry.
char *bioseq_matchinfo(char
*fieldname, char *fieldvalue)
This function finds the database whose BIOSEQ entry contains an
information field with the given field name and field value.
int seqfisafile(char *filename)
This function tests whether the string given to it is an existing
file, even when the string includes a
single entry access specification.
int seqfcangcgify(char *format)
This signals whether the given format is one of the GCG-* formats.
void seqfbytepos(SEQFILE *sfp)
This function returns the byte offset of the current entry in the
current file.
void seqfsetperror(void
(*perr_fn)(char *))
This function sets the "print error" function the package uses to
perform all of its error printing.
Function Changes
SEQFILE *seqfopen(char
*filename, char *mode, char *format)
Seqfopen now automatically read the first entry of the file, thus the
format of a file is always determined when seqfopen returns. Also, it
now supports the single entry access to a file's entries.
int seqftruelen(SEQFILE *sfp)
This function now always returns the "true" length of the current sequence,
ignoring any alignment or notational characters.
char *seqfmainid(SEQFILE *sfp,
int newbuffer)
char *seqfmainacc(SEQFILE
*sfp, int newbuffer)
These two functions are no longer simple access functions to the
SEQINFO structure (since their corresponding fields were removed from
the structure). Now, they access information from the `idlist' field
to construct the "main" identifier and "main" accession number.
int seqfannotate(SEQFILE *sfp, char
*entry, int entrylen, char *newcomment, int flag)
This function now takes a SEQFILE structure, instead of a stdio FILE
structure, as the first parameters. And, the format parameter has
been removed, since the SEQFILE structure specifies what format the
given entry must be.
char *bioseq_info(char *dbspec, char
*fieldname)
A special case has been added to this function, in that when the
fieldname is "Root", the root directory of the datbase's BIOSEQ entry
is now returned. Thus, no information field with the name "Root" can
appear in a BIOSEQ entry. (Ok, it can appear there, but there's no way
to access the information from it.)
Changes from Version 1.0 to Version 1.1
New Formats/Porting and Format Changes
Added PHYLIP Interleaved and
Sequential file formats
Added the Clustalw file format
Added
FASTA/TFASTA/SSEARCH/LFASTA/LALIGN/ALIGN program output format
Reimplemented the NBRF format, now
that I found out where the documentation was.
Ported it to Windows NT/95
Successfully compiled it on Solaris
Successfully compiled it using g++
New Programs
Added fmtseq, the file format conversion
program
Added keyword, a program to search for keyword/motif matches
Data Structure Changes
Added fields `mainid' and
`mainacc' to the SEQINFO structure
So, now the identifiers in an entry are split up into these two fields
plus `idlist'. The `mainid' field gets the main identifier, the
`mainacc' field gets the main accession number, and `idlist' gets all
of the other identifiers.
New Functions
char *seqfrawseq(SEQFILE *sfp,
int *length_out, int newbuffer)
Returns the "raw" sequence given in the entry, which includes any
alignment or structural notation characters in addition to the
sequence itself.
Typically, `seqfsequence' extracts only the alphabetic characters,
whereas `seqfrawseq' extracts all characters except whitespace and
digits. See "format.doc" for the full details.
char *seqfmainid(SEQFILE *sfp,
int newbuffer)
char *seqfmainacc(SEQFILE
*sfp, int newbuffer)
Access functions for the new information fields `mainid' and
`mainacc'.
void seqfsetidpref(SEQFILE
*sfp, char *idprefix)
void seqfsetdbname(SEQFILE
*sfp, char *dbname)
void seqfsetalpha(SEQFILE
*sfp, char *alphabet)
Sets the identifier prefix, database name and sequence alphabet for
the sequences read in using the given SEQFILE structure.
int seqfisaformat(char *format)
Tests a format string to see if it's a support file format.
int seqffmttype(char *format)
Return a type information value about the given format (see
"format.doc" for the details
about the format types).
int seqfcanwrite(char *format)
Can the package output entries in that format?
int seqfcanannotate(char *format)
Can the package annotate entries in that format?
int bioseq_check(char *dbspec)
Does the database search specification refer to a known database? Is
there a BIOSEQ entry for it?
int seqfsetpretty(SEQFILE
*sfp, int value)
When outputting entries in the Plain, FASTA, NBRF or IG/Stanford
formats, this specifies whether to add spaces to make the sequence
look prettier or not.
By default, the output operations look at the sequence being output,
and only add spaces when the sequence is DNA, RNA or Protein and when
there are no non-alphabetic characters in the sequence.
Minor Changes
- Removed "#include <unistd.h>" since it was not needed
- Fixed a bug in the bioseq_parse directory reading code
(it now skips entries "." and "..")
- Replaced "strerror(errno)" with "sys_errlist[errno]"
- Replaced S_ISREG and S_ISDIR with their macro exansions
- Changed the error macros so that the return argument is
the complete return command, instead of just the return value
- Made my own versions of toupper, strcasecmp, strncasecmp
- Changed the `fasta_read' and `fasta_getinfo' functions so that
any lines beginning with ';' that occur before any of the sequence are
considered as part of the entry header and are added to the comment
lines when filling in the SEQINFO fields.
- Changed `seqfsetidpref' and `add_id' to convert all identifier
prefixes to lowercase.
- Made some minor changes to the FASTA, NBRF and IG/Stanford
putseq functions, rearranging where the main identifier and main
accession number are placed in an outputted entry.
- Added a stripflag argument to parse_comment and add_comment so
that spaces won't be stripped from comments in some formats.
- Added prototypes to all of the functions declared just before
the file_table.
- Added `extern "C" {' and '}' ifdef'ed inside `__cplusplus' at
the beginning and end of "seqio.h".
- Created typedefs for the two enum's in the INTSEQFILE structure,
to be compatible with g++ compilation.
- Added explicit conversions for all assignments involving void *
variables.
- In seqfopendb, the format and idprefix information field values
are now tested to see if they contain valid values.
- Fixed a bug in fasta_read, nbrf_read and stanford_read which
allowed the current file pointer to move past the end of the read file
buffer (which causes a seg fault when mmap buffers are being used).
- The access functions for the SEQINFO fields have been collapsed
into a bunch of stub functions and intseq_field[123].
- Added format specific variables to the INTSEQFILE structure,
which are used by the NBRF, PHYLIP, Clustalw and FASTA-output formats.
- Fixed GenBank, PIR, EMBL, Swiss-Prot and NBRF output functions so
that accession number lists don't overflow past the line length.
James R. Knight,
knight@cs.ucdavis.edu
July 9, 1996