SEQIO -- A Package for Sequence File I/O

CHANGES - Changes to the SEQIO Package

This file lists the changes made when going from one version to the next. It should be detailed enough that you won't need to go through the rest of the documentation to find out what's new.


Changes from Version 1.2 to Version 1.3

Minor Changes

Changes from Version 1.1 to Version 1.2

New Formats/Porting and Format Changes

Added the GCG format

Added the GCG-* format specification of the GCG form of GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF and IG/Stanford formats.

Added the MSF Multiple Sequence Format

Added BLASTN/BLASTP/BLASTX program output format

Added handling for NID and PID identifiers in the GenBank and EMBL formats (although, since neither formats' release notes explicitly defines a PID/PI line, no such line is output by the package).

New Programs and Program Changes

Added idxseq, a database indexing program

Added a number of example programs.

Changed the name of keyword to grepseq.

Extended the fmtseq program in the following ways:

New Capabilities of the SEQIO Package

Added the ability for the user to specify single entries of a file, specifying either by entry position, by byte offset or by entry identifier.

Added the ability for the user to specify single entries of a database, using the database identifiers and random access of the database entries.

The BIOSEQ environment variable can now take a full PATH-like specification, specifying more than one BIOSEQ file.

BIOSEQ entries can now have multiline information fields.

Data Structure Changes

Added the fields `rawlen' and `fragstart' to the SEQINFO structure

Removed the `mainid' and `mainacc' fields from the SEQINFO structure and moved all identifiers into `idlist'.

New Functions

char *seqfgetrawseq(SEQFILE *sfp, int *length_out, int newbuffer)
Added the `get' version of `seqfrawseq', because it's lack was annoying.
int seqffragstart(SEQFILE *sfp)
This SEQINFO access function returns the starting position of a fragment sequence (if the sequence is a fragment and the starting position is known).
int seqfrawlen(SEQFILE *sfp)
The SEQINFO access function returns the length of the raw sequence.
int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly)
This function constructs a "oneline description" of a sequence, based on the information in the SEQINFO structure.
int seqfputs(SEQFILE *sfp, char *s, int len)
This function outputs a string on the output stream opened for the SEQFILE structure.
int seqfgcgify(SEQFILE *sfp, char *entry, int entrylen)
This function takes an entry in the non-GCG form of one of the GCG-* formats and outputs the GCG form of that entry.
int seqfungcgify(SEQFILE *sfp, char *entry, int entrylen)
This function takes an entry in the GCG form of one of the GCG-* formats and outputs the non-GCG form of that entry.
char *bioseq_matchinfo(char *fieldname, char *fieldvalue)
This function finds the database whose BIOSEQ entry contains an information field with the given field name and field value.
int seqfisafile(char *filename)
This function tests whether the string given to it is an existing file, even when the string includes a single entry access specification.
int seqfcangcgify(char *format)
This signals whether the given format is one of the GCG-* formats.
void seqfbytepos(SEQFILE *sfp)
This function returns the byte offset of the current entry in the current file.
void seqfsetperror(void (*perr_fn)(char *))
This function sets the "print error" function the package uses to perform all of its error printing.

Function Changes

SEQFILE *seqfopen(char *filename, char *mode, char *format)
Seqfopen now automatically read the first entry of the file, thus the format of a file is always determined when seqfopen returns. Also, it now supports the single entry access to a file's entries.
int seqftruelen(SEQFILE *sfp)
This function now always returns the "true" length of the current sequence, ignoring any alignment or notational characters.
char *seqfmainid(SEQFILE *sfp, int newbuffer)
char *seqfmainacc(SEQFILE *sfp, int newbuffer)
These two functions are no longer simple access functions to the SEQINFO structure (since their corresponding fields were removed from the structure). Now, they access information from the `idlist' field to construct the "main" identifier and "main" accession number.
int seqfannotate(SEQFILE *sfp, char *entry, int entrylen, char *newcomment, int flag)
This function now takes a SEQFILE structure, instead of a stdio FILE structure, as the first parameters. And, the format parameter has been removed, since the SEQFILE structure specifies what format the given entry must be.
char *bioseq_info(char *dbspec, char *fieldname)
A special case has been added to this function, in that when the fieldname is "Root", the root directory of the datbase's BIOSEQ entry is now returned. Thus, no information field with the name "Root" can appear in a BIOSEQ entry. (Ok, it can appear there, but there's no way to access the information from it.)


Changes from Version 1.0 to Version 1.1

New Formats/Porting and Format Changes

Added PHYLIP Interleaved and Sequential file formats

Added the Clustalw file format

Added FASTA/TFASTA/SSEARCH/LFASTA/LALIGN/ALIGN program output format

Reimplemented the NBRF format, now that I found out where the documentation was.

Ported it to Windows NT/95
Successfully compiled it on Solaris
Successfully compiled it using g++

New Programs

Added fmtseq, the file format conversion program

Added keyword, a program to search for keyword/motif matches

Data Structure Changes

Added fields `mainid' and `mainacc' to the SEQINFO structure
So, now the identifiers in an entry are split up into these two fields plus `idlist'. The `mainid' field gets the main identifier, the `mainacc' field gets the main accession number, and `idlist' gets all of the other identifiers.

New Functions

char *seqfrawseq(SEQFILE *sfp, int *length_out, int newbuffer)
Returns the "raw" sequence given in the entry, which includes any alignment or structural notation characters in addition to the sequence itself. Typically, `seqfsequence' extracts only the alphabetic characters, whereas `seqfrawseq' extracts all characters except whitespace and digits. See "format.doc" for the full details.
char *seqfmainid(SEQFILE *sfp, int newbuffer)
char *seqfmainacc(SEQFILE *sfp, int newbuffer)
Access functions for the new information fields `mainid' and `mainacc'.
void seqfsetidpref(SEQFILE *sfp, char *idprefix)
void seqfsetdbname(SEQFILE *sfp, char *dbname)
void seqfsetalpha(SEQFILE *sfp, char *alphabet)
Sets the identifier prefix, database name and sequence alphabet for the sequences read in using the given SEQFILE structure.
int seqfisaformat(char *format)
Tests a format string to see if it's a support file format.
int seqffmttype(char *format)
Return a type information value about the given format (see "format.doc" for the details about the format types).
int seqfcanwrite(char *format)
Can the package output entries in that format?
int seqfcanannotate(char *format)
Can the package annotate entries in that format?
int bioseq_check(char *dbspec)
Does the database search specification refer to a known database? Is there a BIOSEQ entry for it?
int seqfsetpretty(SEQFILE *sfp, int value)
When outputting entries in the Plain, FASTA, NBRF or IG/Stanford formats, this specifies whether to add spaces to make the sequence look prettier or not.

By default, the output operations look at the sequence being output, and only add spaces when the sequence is DNA, RNA or Protein and when there are no non-alphabetic characters in the sequence.

Minor Changes


James R. Knight, knight@cs.ucdavis.edu
July 9, 1996