pdbtosp

 

Function

Convert raw swissprot:PDB equivalence file to EMBL-like format

Description

This program is part of a suite of EMBOSS applications that directly or indirectly make use of the protein structure databases pdb and scop. This program is part of an experimental analysis pipeline described in an accompanying document. We provide the software in the hope that it will be useful. The applications were designed for specific research purposes and may not be useful or reliable in contexts other than the described pipeline. The development of the suite was coordinated by Jon Ison to whom enquiries and bug reports should be sent (email jison@hgmp.mrc.ac.uk).

Some research applications require knowledge of the database sequence(s) that corresponds to the sequence(s) given in a pdb file. A 'swissprot:pdb equivalence' file listing accession numbers and swissprot database identifiers for certain pdb code is available, but is not in a format that is consistent with flat file formats used for protein structural data in emboss. pdbtosp parses the swissprot:pdb equivalence in its raw format and converts it to an embl-like format.

pdbtosp parses the swissprot:pdb equivalence table available at URL (1) (1) http://www.expasy.ch/cgi-bin/lists?pdbtosp.txt and writes the data out in embl-like format file (Figure 2). The raw (input) file can be obtained by doing a "save as ... text format" from the web page (1). No changes are made to the data other than changing the format in which it is held. The input and output files are specified by the user.

Algorithm

pdbtosp relies on finding a line beginning with '____ _' in the input file (all lines up to and including this one are ignored). Lines of code data are then parsed, up until the first blank line.

Usage

Here is a sample session with pdbtosp


% pdbtosp 
Convert raw swissprot:PDB equivalence file to EMBL-like format
Name of raw swissprot:PDB equivalence file (input): pdbtosp.txt
Name of swissprot:PDB equivalence file (EMBL-like format output) [Epdbtosp.dat]: 

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-infile]            infile     This option specifies the name of raw
                                  swissprot:PDB equivalence file (input).
                                  HETPARSE parses this file, which is
                                  available at URL
                                  http://www.expasy.ch/cgi-bin/lists?pdbtosp.txt
  [-outfile]           outfile    This option specifies the name of
                                  swissprot:PDB equivalence file (EMBL-like
                                  format). This is the PDBTOSP output file.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-infile]
(Parameter 1)
This option specifies the name of raw swissprot:PDB equivalence file (input). HETPARSE parses this file, which is available at URL http://www.expasy.ch/cgi-bin/lists?pdbtosp.txt Input file Required
[-outfile]
(Parameter 2)
This option specifies the name of swissprot:PDB equivalence file (EMBL-like format). This is the PDBTOSP output file. Output file Epdbtosp.dat
Additional (Optional) qualifiers Allowed values Default
(none)
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

pdbtosp reads any normal sequence USAs.

Input files for usage example

File: pdbtosp.txt

  ------------------------------------------------------------------------
   ExPASy Home page   Site Map    Search ExPASy   Contact us    SWISS-PROT

 Hosted by SIB       Mirror                                        USA[new]
 Switzerland         sites:      AustraliaCanada China Korea Taiwan
  ------------------------------------------------------------------------

----------------------------------------------------------------------------
        SWISS-PROT Protein Knowledgebase
        Swiss Institute of Bioinformatics (SIB); Geneva, Switzerland
        European Bioinformatics Institute (EBI); Hinxton, United Kingdom
----------------------------------------------------------------------------

Description: Index of Protein Data Bank (PDB) entries referenced in
             SWISS-PROT
Name:        PDBTOSP.TXT
Release:     40.9 of 31-Jan-2002

----------------------------------------------------------------------------

The PDB database is available at the following URL:

USA:  http://www.rcsb.org/pdb/
EBI:  http://www2.ebi.ac.uk/pdb/

 - Number of PDB entries referenced in SWISS-PROT: 9901
 - Number of SWISS-PROT entries with one or more pointers to PDB: 3260

PDB   Last revision
code  date           SWISS-PROT entry name(s)
____  ___________    __________________________________________
101M  (08-APR-98)  : MYG_PHYCA   (P02185)
102L  (31-OCT-93)  : LYCV_BPT4   (P00720)
102M  (08-APR-98)  : MYG_PHYCA   (P02185)
103L  (31-OCT-93)  : LYCV_BPT4   (P00720)
103M  (08-APR-98)  : MYG_PHYCA   (P02185)
9XIA  (15-JUL-92)  : XYLA_STRRU  (P24300)
9XIM  (15-JUL-93)  : XYLA_ACTMI  (P12851)

----------------------------------------------------------------------------
SWISS-PROT is copyright.  It is produced through a collaboration between the
Swiss Institute  of  Bioinformatics   and the EMBL Outstation - the European
Bioinformatics Institute. There are no restrictions on its use by non-profit
institutions as long as its  content is in no way modified. Usage by and for
commercial entities requires a license agreement.  For information about the
licensing  scheme  see: http://www.isb-sib.ch/announce/ or send  an email to
license@isb-sib.ch.
----------------------------------------------------------------------------

  ------------------------------------------------------------------------
   ExPASy Home page   Site Map    Search ExPASy   Contact us    SWISS-PROT

 Hosted by SIB       Mirror                                        USA[new]
 Switzerland         sites:      AustraliaCanada China Korea Taiwan
  ------------------------------------------------------------------------

Output file format

Output files for usage example

File: Epdbtosp.dat

EN   101M
XX
NE   1
XX
IN   MYG_PHYCA ID; P02185 ACC;
XX
//
EN   102L
XX
NE   1
XX
IN   LYCV_BPT4 ID; P00720 ACC;
XX
//
EN   102M
XX
NE   1
XX
IN   MYG_PHYCA ID; P02185 ACC;
XX
//
EN   103L
XX
NE   1
XX
IN   LYCV_BPT4 ID; P00720 ACC;
XX
//
EN   103M
XX
NE   1
XX
IN   MYG_PHYCA ID; P02185 ACC;
XX
//
EN   9XIA
XX
NE   1
XX
IN   XYLA_STRRU ID; P24300 ACC;
XX
//
EN   9XIM
XX
NE   1
XX
IN   XYLA_ACTMI ID; P12851 ACC;
XX
//

An excerpt from the swissprot:pdb equivalence file in embl-like format is shown below. The records used to describe an entry are as follows.

(1) EN - PDB identifier code. This is the 4-character PDB identifier code.

(2) NE - Number of entries. This is the number of accession numbers that were given for that pdb code in the equivalence file.

(3) IN - Code line. The swissprot database identifier code and accession number are given preceeding ID and ACC respectively.

Excerpt from embl-like format swissprot:pdb equivalence file

EN   3SDH
XX
NE   2
XX
IN   LEU3_THETH ID; P00351 ACC;
IN   LEU4_THEFF ID; P02351 ACC;
XX
//
EN   2SDH
XX
NE   1
XX
IN   LEU1_FDFTH ID; P11351 ACC;
XX
//     

Data files

None.

Notes

pdbtosp is used to create the emboss data file Epdbtosp.dat that is included in the emboss distribution.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
aaindexextractExtract data from AAINDEX
allversusallDoes an all-versus-all global alignment for each set of sequences in an input directory and writes files of sequence similarity values
cathparseReads raw CATH classification files and writes DCF file (domain classification file)
cutgextractExtract data from CUTG
domainerReads CCF files (clean coordinate files) for proteins and writes CCF files for domains, taken from a DCF file (domain classification file)
domainnrRemoves redundant domains from a DCF file (domain classification file). The file must contain domain sequence information, which can be added by using DOMAINSEQS
domainseqsAdds sequence records to a DCF file (domain classification file)
domainsseAdds secondary structure records to a DCF file (domain classification file)
hetparseConverts raw dictionary of heterogen groups to a file in EMBL-like format
pdbparseParses PDB files and writes CCF files (clean coordinate files) for proteins
pdbplusAdd residue solvent accessibility and secondary structure data to a CCF file (clean coordinate file) for a protein or domain
printsextractExtract data from PRINTS
prosextractBuilds the PROSITE motif database for patmatmotifs to search
rebaseextractExtract data from REBASE
scopparseReads raw SCOP classification files and writes a DCF file (domain classification file)
seqnrRemoves redundancy from DHF files (domain hits files) or other files of sequences
sitesReads CCF files (clean coordinate files) and writes CON files (contact files) of residue-ligand contact data for domains in a DCF file (domain classification file)
ssematchSearches a DCF file (domain classification file) for secondary structure matches
tfextractExtract data from TRANSFAC

scopseqs uses the pdbtosp output file as input.

Author(s)

Jon Ison (jison © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

Written (2003) - Jon Ison.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.