|
DOMAINALIGN documentation
|
CONTENTS
1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES
1.0 SUMMARY
Generates DAF files (domain alignment files) of structure-based sequence alignments for nodes in a DCF file (domain classification file)
2.0 INPUTS & OUTPUTS
DOMAINALIGN reads a DCF file (domain classification file) and generates a structure-based sequence alignment annotated with domain classification data ('domain alignment file') for each user-defined node (e.g. family or superfamily) in the DCF file in turn. If the STAMP algorithm is used, structural superimpositions are also generated and saved to file (PDB format). The alignments are calculated by using stamp or TCOFFEE and these applications must be installed on the system that is running DOMAINALIGN (see 'Notes' below).
Clearly no alignment can be generated for nodes with a single entry (domain) only: sequences for such domains are (optionally) written to file (fasta format).
DOMAINALIGN requires a directory of domain PDB files; the path and extension of these must be set by the user (via the ACD file) and also specified in the stamp "pdb.directories" file (see 'Notes' below)
A log file of diagnostic messages is written. The identifier (e.g SCOP Sunid) of the nodes from the DCF file are used to name the output files. The user also specifies the input file, paths for the two types of alignment files (output), path of singlet sequence files (if output) and name of log file.
3.0 INPUT FILE FORMAT
The format of the domain classification file is described in scopparse.c
4.0 OUTPUT FILE FORMAT
Structure-based sequence alignment
This file (Figure 1) is in EMBOSS "simple" multiple sequence alignment format.
This is similar to the output file generated by stamp when issued with the following three types of command:
(1) stamp -l ./stamps_file.dom -s -n 2 -slide 5 -prefix ./stamps_file -d
./stamps_file.set;sorttrans -f ./stamps_file.scan -s Sc 2.5 >
./stamps_file.sort;stamp -l ./stamps_file.sort -prefix ./stamps_file >
./stamps_file.log
(2) poststamp -f ./stamps_file.3 -min 0.5
(3) ver2hor -f ./stamps_file.3.post > ./stamps_file.out
|
The DOMAINALIGN output file (Figure 1) displays the sequence names,
positions and sequences. The names are the 7 character domain
identifier codes taken from the domain classification file. The
positions are the start and end residue positions of the appropriate
section of sequence. The sequence uses '-' as a gap character. The
domain classification records for the appopriate node from the DCF
file are given above the alignment. The STAMP 'Post similar' line is
given as a markup line underneath the sequence but no dssp assignments
are written. All lines other than sequence lines begin with '#' to
denote a comment.
5.0 DATA FILES
DOMAINALIGN does not use any data files but uses the stamp
"pdb.directories" file which specifies the permissible prefix, extension and path of
PDB files used by STAMP. On the RFCGR system, this file is
/packages/stamp/defs/pdb.directories and should look like :
test_data/ - .dent
/data/pdb - -
/data/pdb _ .ent
/data/pdb _ .pdb
/data/pdb pdb .ent
/data/pdbscop _ _
/data/pdbscop _ .ent
/data/pdbscop _ .pdb
/data/pdbscop pdb .ent
./ _ _
./ _ .ent
./ _ .ent.z
./ _ .ent.gz
./ _ .pdb
./ _ .pdb.Z
./ _ .pdb.gz
./ pdb .ent
./ pdb .ent.Z
./ pdb .ent.gz
/data/CASS1/pdb/coords/ _ .pdb
/data/CASS1/pdb/coords/ _ .pdb.Z
/data/CASS1/pdb/coords/ _ .pdb.gz
|
6.0 USAGE
6.1 COMMAND LINE ARGUMENTS
Standard (Mandatory) qualifiers (* if not always prompted):
[-dcfinfile] infile This option specifies the name of DCF file
(domain classification file) (input). A
'domain classification file' contains
classification and other data for domains
from SCOP or CATH, in DCF format
(EMBL-like). The files are generated by
using SCOPPARSE and CATHPARSE. Domain
sequence information can be added to the
file by using DOMAINSEQS.
[-pdbdir] directory This option specifies the location of domain
PDB files (input). A 'domain PDB file'
contains coordinate data for a single domain
from SCOP or CATH, in PDB format. The files
are generated by using DOMAINER.
-node menu This option specifies the node for
redundancy removal. Redundancy can be
removed at any specified node in the SCOP or
CATH hierarchies. For example by selecting
'Class' entries belonging to the same Class
will be non-redundant.
-mode menu This option specifies the alignment
algorithm to use.
-[no]keepsinglets toggle This option specifies whether to write
sequences of singlet families to file. If
you specify this option, the sequence for
each singlet family are written to file
(output).
* -singletsoutdir outdir This option specifies the location of DHF
files (domain hits files) for singlet
sequences (output). The singlets are written
out as a 'domain hits file' - which
contains database hits (sequences) with
domain classification information, in FASTA
format.
[-dafoutdir] outdir This option specifies the location of DAF
files (domain alignment files) (output). A
'domain alignment file' contains a sequence
alignment of domains belonging to the same
SCOP or CATH family. The files are in
clustal format and are annotated with domain
family classification information. The
files generated by using SCOPALIGN will
contain a structure-based sequence alignment
of domains of known structure only. Such
alignments can be extended with sequence
relatives (of unknown structure) by using
SEQALIGN.
* -superoutdir outdir This option specifies the location of
structural superimposition files (output). A
file in PDB format of the structural
superimposition is generated for each family
if the STAMP algorithm is used.
-logfile outfile This option specifies the name of log file
(output). The log file contains messages
about any errors arising while domainalign
ran.
Additional (Optional) qualifiers: (none)
Advanced (Unprompted) qualifiers: (none)
Associated qualifiers:
"-logfile" associated qualifiers
-odirectory string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report deaths
Standard (Mandatory) qualifiers |
Allowed values |
Default |
[-dcfinfile] (Parameter 1) |
This option specifies the name of DCF file (domain classification file) (input). A 'domain classification file' contains classification and other data for domains from SCOP or CATH, in DCF format (EMBL-like). The files are generated by using SCOPPARSE and CATHPARSE. Domain sequence information can be added to the file by using DOMAINSEQS. |
Input file |
Required |
[-pdbdir] (Parameter 2) |
This option specifies the location of domain PDB files (input). A 'domain PDB file' contains coordinate data for a single domain from SCOP or CATH, in PDB format. The files are generated by using DOMAINER. |
Directory |
./ |
-node |
This option specifies the node for redundancy removal. Redundancy can be removed at any specified node in the SCOP or CATH hierarchies. For example by selecting 'Class' entries belonging to the same Class will be non-redundant. |
1 | (Class (SCOP)) | 2 | (Fold (SCOP)) | 3 | (Superfamily (SCOP)) | 4 | (Family (SCOP)) | 5 | (Class (CATH)) | 6 | (Architecture (CATH)) | 7 | (Topology (CATH)) | 8 | (Homologous Superfamily (CATH)) | 9 | (Family (CATH)) |
|
1 |
-mode |
This option specifies the alignment algorithm to use. |
|
1 |
-[no]keepsinglets |
This option specifies whether to write sequences of singlet families to file. If you specify this option, the sequence for each singlet family are written to file (output). |
Toggle value Yes/No |
Yes |
-singletsoutdir |
This option specifies the location of DHF files (domain hits files) for singlet sequences (output). The singlets are written out as a 'domain hits file' - which contains database hits (sequences) with domain classification information, in FASTA format. |
Output directory |
|
[-dafoutdir] (Parameter 3) |
This option specifies the location of DAF files (domain alignment files) (output). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family. The files are in clustal format and are annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. |
Output directory |
./ |
-superoutdir |
This option specifies the location of structural superimposition files (output). A file in PDB format of the structural superimposition is generated for each family if the STAMP algorithm is used. |
Output directory |
./ |
-logfile |
This option specifies the name of log file (output). The log file contains messages about any errors arising while domainalign ran. |
Output file |
domainalign.log |
Additional (Optional) qualifiers |
Allowed values |
Default |
(none) |
Advanced (Unprompted) qualifiers |
Allowed values |
Default |
(none) |
6.2 EXAMPLE SESSION
An example of interactive use of DOMAINALIGN is shown below.
7.0 KNOWN BUGS & WARNINGS
1. Use of stamp
DOMAINALIGN requires a modified version of stamp (see Notes below).
The modified stamp application must be installed on the system that is running DOMAINALIGN.
When running DOMAINALIGN at the RFCGR, to ensure the modified version of stamp is used,
type 'use stamp2' (which runs the script /packages/menu/USE/stamp2) before DOMAINALIGN is run.
2. Strange stamp behaviour
stamp will ignore (omit from the alignment and *not* replace with '-' or
any other symbol) ANY residues or groups in a PDB file that
(i) are not structured (i.e. do not appear in the ATOM records) or
(ii) lack a CA atom, regardless of whether it is a known amino acid or not.
This means that the position (column) in the alignment cannot reliably be
used as the basis for an index into arrays representing the full length
sequences.
stamp will however include in the alignment residues with a single atom
only, so long as it is the CA atom.
3. Handling of singlet nodes
No sequence alignment or structural superimposition files are generated for nodes that contain a single domain only. Sequences for such domains can be saved to file (see 2.0 INPUTS & OUTPUTS).
4. Alignment numbering
Residue number positions in alignment are not implemented (zero's are given).
8.0 NOTES
1. Adaption of STAMP for domain codes
DOMAINALIGN will only run with with a version of stamp which has been modified
so that PDB id codes of length greater than 4 characters are acceptable.
This involves a trivial change to the stamp module getdomain.c (around line
number 155), a 4 must be changed to a 7 as follows:
temp=getfile(domain[0].id,dirfile,4,OUTPUT);
temp=getfile(domain[0].id,dirfile,7,OUTPUT);
|
2. Adaption of STAMP for larger datasets
STAMP fails to align a large dataset of all the available V set Ig
domains. The ver2hor module generates the following error:
Transforming coordinates...
...done.
ver2hor -f ./domainalign-1022069396.11280.76.post > ./domainalign-1022069396.11280.out
error: something wrong with STAMP file
STAMP length is 370, Alignment length is 422
STAMP nseq is 155, Alignment nseq is 155
|
This is fixed by the following change in alignfit.h.
#define MAXtlen 200
#define MAXtlen 2000
|
At the same time the following may be changed as a safety measure:
gstamp.c : #define MAX_SEQ_LEN 10000 (was 2000)
pdbseq.c : #define MAX_SEQ_LEN 10000 (was 3000)
defaults.h: #define MAX_SEQ_LEN 10000 (was 8000)
defaults.h: #define MAX_NSEQ 10000 (was 1000)
defaults.h: #define MAX_BLOC_SEQ 5000 (was 500)
dstamp.h : #define MAX_N_SEQ 10000 (was 1000)
ver2hor.h : #define MAX_N_SEQ 10000 (was 1000)
|
The modified code (for 2. and 3. above) is kept on the HGMP file system in /packages/stamp/src2
WHEN RUNNING DOMAINALIGN AT THE HGMP IT IS ESSENTIAL THAT THE COMMAND
'use stamp2' (which runs the script /packages/menu/USE/stamp2) IS GIVEN
BEFORE DOMAINALIGN IS RUN.
This will ensure that the modified version of stamp is used.
3. pdb.directories file
stamp (and therefore DOMAINALIGN) uses a "pdb.directories" file: see 5.0 DATA FILES
4. Choice of alignment algorithm
Future versions of DOMAINALIGN will implement a larger choice of alignment algorithms.
5. Getting the best alignment
DOMAINALIGN will produce better alignments if the DCF file is reordered so that the representative structure of each node (e.g. family) is given first. This is achieved by using DOMAINREP.
6. Whitespace in alignment
STAMP can insert non-sensical whitespaces into its alignments, e.g. instead of a residue character where that residue was missing electron density in the PDB file. DOMAINALIGN replaces each whitespace within a STAMP alignment with an "X".
8.1 GLOSSARY OF FILE TYPES
FILE TYPE |
FORMAT |
DESCRIPTION |
CREATED BY |
SEE ALSO |
Domain classification file (for SCOP) |
DCF format (EMBL-like format for domain classification data). |
Classification and other data for domains from SCOP. |
SCOPPARSE |
Domain sequence information can be added to the file by using DOMAINSEQS. |
Domain classification file (for CATH) |
DCF format (EMBL-like format for domain classification data). |
Classification and other data for domains from CATH. |
CATHPARSE |
Domain sequence information can be added to the file by using DOMAINSEQS. |
Domain PDB file |
PDB format for domain coordinate data. |
Coordinate data for a single domain from SCOP or CATH. |
DOMAINER |
N.A. |
Domain alignment file |
DAF format (clustal format with domain classification information). |
Contains a sequence alignment of domains belonging to the same SCOP or CATH family. The file is annotated with domain family classification information. |
DOMAINALIGN (structure-based sequence alignment of domains of known structure). |
DOMAINALIGN alignments can be extended with sequence relatives (of unknown structure) to the family in question by using SEQALIGN. |
None
9.0 DESCRIPTION
The generation of alignments for large datasets such as SCOP and CATH potentially requires a lot of time for preparation of datasets, writing of scripts, running individual jobs and so on, in addition to the compute time required for the alignments themselves. DOMAINALIGN automates this process: it reads a domain classification file and generates alignments for each user-specified node in turn.
10.0 ALGORITHM
More information on stamp can be found at
http://www.compbio.dundee.ac.uk/manuals/stamp.4.2
More information on TCOFFEE can be found at http://www.ch.embnet.org/software/TCoffee.html
11.0 RELATED APPLICATIONS
Program name | Description |
contactcount | Counts specific versus non-specific contacts in a directory of cleaned protein chain contact files |
contacts | Reads CCF files (clean coordinate files) and writes CON files (contact files) of intra-chain residue-residue contact data |
domainrep | Reorder DCF file (domain classification file) so that the representative structure of each user-specified node is given first |
domainreso | Removes low resolution domains from a DCF file (domain classification file) |
interface | Reads CCF files (clean coordinate files) and writes CON files (contact files) of inter-chain residue-residue contact data |
libgen | Generates various types of discriminating elements for each alignment in a directory |
psiphi | Calculates phi and psi torsion angles from cleaned EMBOSS-style protein co-ordinate file |
rocon | Reads a DHF file (domain hits file) of hits (sequences of unknown structural classification) and a DHF file of validation sequences (known classification) and writes a 'hits file' for the hits, which are classified and rank-ordered on the basis of score |
rocplot | Provides interpretation and graphical display of the performance of discriminating elements (e.g. profiles for protein families). rocplot reads file(s) of hits from discriminator-database search(es), performs ROC analysis on the hits, and writes graphs illustrating the diagnostic performance of the discriminating elements |
seqalign | Reads a DAF file (domain alignment file) and a DHF file (domain hits file) and writes a DAF file extended with the hits |
seqfraggle | Removes fragments from DHF files (domain hits files) or other files of sequences |
seqsearch | Generate database hits (sequences) for nodes in a DCF file (domain classification file) by using PSI-BLAST |
seqsort | Reads DHF files (domain hits files) of database hits (sequences) and removes hits of ambiguous classification |
seqwords | Generates DHF files (domain hits files) of database hits (sequences) for nodes in a DCF file (domain classification file) by keyword search of UniProt |
siggen | Generates a sparse protein signature from an alignment and residue contact data |
sigscan | Generates a DHF file (domain hits file) of hits (sequences) from scanning a signature against a sequence database |
12.0 DIAGNOSTIC ERROR MESSAGES
The following message may appear in the log file.
Replaced ' ' in STAMP alignment with 'X' (STAMP can insert non-sensical whitespaces into its alignments, e.g. instead of a residue character where that residue was missing electron density in the PDB file. DOMAINALIGN replaces each whitespace within a STAMP alignment with an "X").
13.0 AUTHORS
Ranjeeva Ranasinghe (rranasin@rfcgr.mrc.ac.uk)
Jon Ison (jison@rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK
14.0 REFERENCES
Please cite the authors and EMBOSS.
Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
Molecular Biology Open Software Suite" Trends in Genetics,
15:276-278.
See also http://emboss.sourceforge.net/
14.1 Other useful references
Russell, R. B. & Barton, G. J. (1992), Multiple Sequence Alignment from Tertiary Structure Comparison: Assignment of Global and Residue Confidence Levels, PROTEINS: Struct. Funct. Genet., 14, 309-323.
C. Notredame, D. Higgins, J. Heringa. T-Coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology, 302, 205-217, (2000)
More information on stamp can be found at http://www.compbio.dundee.ac.uk/manuals/stamp.4.2/
More information on TCOFFEE can be found at http://www.ch.embnet.org/software/TCoffee.html