hetparse

 

Function

Converts raw dictionary of heterogen groups to a file in EMBL-like format

Description

This program is part of a suite of EMBOSS applications that directly or indirectly make use of the protein structure databases pdb and scop. This program is part of an experimental analysis pipeline described in an accompanying document. We provide the software in the hope that it will be useful. The applications were designed for specific research purposes and may not be useful or reliable in contexts other than the described pipeline. The development of the suite was coordinated by Jon Ison to whom enquiries and bug reports should be sent (email jison@hgmp.mrc.ac.uk).

Some research applications require knowledge of the types of heterogen (non-protein groups) that are represented in pdb files. A dictionary of heterogen groups containing various data for all of the heterogens found in pdb is available, but is not in a format that is consistent with flat file formats used for protein structural data in emboss. hetparse parses the dictionary in its raw format and converts it to an embl-like format.

hetparse parse the dictionary of heterogen groups available at http://pdb.rutgers.edu/het_dictionary.txt and writes a file containing the group names, synonyms and 3-letter codes in embl-like format. Optionally, hetparse will search a directory of pdb files and will count the number of files that each heterogen appears in. The path and extension for the pdb files and the names of the input and output files are user- specified.

Algorithm

Contact the author

Usage

Here is a sample session with hetparse


% hetparse 
Converts raw dictionary of heterogen groups to a file in EMBL-like format.
Name of input file (raw dictionary of heterogen groups): het.txt
Search a directory of PDB files with keywords? [N]: Y
Directory to search with keywords [./]: 
Name of EMBL-like format dictionary of heterogen groups. [Ehet.dat]: Ehet.dat

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-infile]            infile     This option specifies the name of input file
                                  (raw dictionary of heterogen groups) to
                                  parse, which should be of the format
                                  specified at
                                  http://pdb.rutgers.edu/het_dictionary.txt
   -dogrep             toggle     This option specifies whether to search a
                                  directory of files (typically PDB files)
                                  with keywords. If set, HETPARSE will search
                                  the directory and will count the number of
                                  files that each heterogen appears in.
*  -dirlistpath        dirlist    This option specifies the directory to
                                  search with keywords.
  [-outfile]           outfile    This option specifies the name of EMBL-like
                                  format dictionary of heterogen groups.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-infile]
(Parameter 1)
This option specifies the name of input file (raw dictionary of heterogen groups) to parse, which should be of the format specified at http://pdb.rutgers.edu/het_dictionary.txt Input file Required
-dogrep This option specifies whether to search a directory of files (typically PDB files) with keywords. If set, HETPARSE will search the directory and will count the number of files that each heterogen appears in. Toggle value Yes/No No
-dirlistpath This option specifies the directory to search with keywords. Directory with files ./
[-outfile]
(Parameter 2)
This option specifies the name of EMBL-like format dictionary of heterogen groups. Output file Ehet.dat
Additional (Optional) qualifiers Allowed values Default
(none)
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

hetparse reads any normal sequence USAs.

Input files for usage example

File: het.txt

RESIDUE   061     58
CONECT      N1     2 N2   C5  
CONECT      N2     2 N1   N3  
CONECT      N3     2 N2   N4  
CONECT      N4     3 N3   C5   HN4 
CONECT      C5     3 N1   N4   C6  
CONECT      C6     3 C5   C7   C11 
CONECT      C7     3 C6   C8   C12 
CONECT      C8     3 C7   C9   H8  
CONECT      C9     3 C8   C10  H9  
CONECT      C10    3 C9   C11  H10 
CONECT      C11    3 C6   C10  H11 
CONECT      C12    3 C7   C13  C17 
CONECT      C13    3 C12  C14  H13 
CONECT      C14    3 C13  C15  H14 
CONECT      C15    3 C14  C16  C18 
CONECT      C16    3 C15  C17  H16 
CONECT      C17    3 C12  C16  H17 
CONECT      C18    4 C15  N19 1H18 2H18 
CONECT      N19    3 C18  C20  C33 
CONECT      C20    3 N19  C21  N25 
CONECT      C21    4 C20  C22 1H21 2H21 
CONECT      C22    4 C21  C23 1H22 2H22 
CONECT      C23    4 C22  C24 1H23 2H23 
CONECT      C24    4 C23 1H24 2H24 3H24 
CONECT      N25    2 C20  C26 
CONECT      C26    3 N25  C27  C32 
CONECT      C27    3 C26  C28  H27 
CONECT      C28    3 C27  C29  H28 
CONECT      C29    3 C28  O30  C31 
CONECT      O30    2 C29  HOU 
CONECT      C31    3 C29  C32  H31 
CONECT      C32    3 C26  C31  C33 
CONECT      C33    3 N19  C32  O34 
CONECT      O34    1 C33 
CONECT      HN4    1 N4  
CONECT      H8     1 C8  
CONECT      H9     1 C9  
CONECT      H10    1 C10 
CONECT      H11    1 C11 
CONECT      H13    1 C13 
CONECT      H14    1 C14 
CONECT      H16    1 C16 
CONECT      H17    1 C17 
CONECT     1H18    1 C18 
CONECT     2H18    1 C18 
CONECT     1H21    1 C21 
CONECT     2H21    1 C21 
CONECT     1H22    1 C22 
CONECT     2H22    1 C22 


  [Part of this file has been deleted for brevity]

CONECT     2H6     1 C6  
CONECT     1H8     1 C8  
CONECT     2H8     1 C8  
CONECT     1H9     1 C9  
CONECT     2H9     1 C9  
END
HET    104             28
HETSYN     104 TRIENTINE
HETNAM     104 N,N'-BIS(2-AMINOETHYL)-1,2-ETHANEDIAMINE
FORMUL      104    C6 H18 N4

RESIDUE   105     32
CONECT      B      3 O1   O2   C3  
CONECT      O1     2 B    H1  
CONECT      O2     2 B    H2  
CONECT      C3     4 B    N4  1H3  2H3  
CONECT      N4     3 C3   C5   H4  
CONECT      C5     3 N4   O6   C7  
CONECT      O6     1 C5  
CONECT      C7     3 C5   C8   C12 
CONECT      N11    2 O10  C12 
CONECT      O10    2 N11  C8  
CONECT      C8     3 C7   O10  C9  
CONECT      C12    3 C7   N11  C13 
CONECT      C9     4 C8  1H9  2H9  3H9  
CONECT      C13    3 C12  C14  C18 
CONECT      C14    3 C13  C15 CL1  
CONECT     CL1     1 C14 
CONECT      C15    3 C14  C16  H15 
CONECT      C16    3 C15  C17  H16 
CONECT      C17    3 C16  C18  H17 
CONECT      C18    3 C13  C17  H18 
CONECT      H1     1 O1  
CONECT      H2     1 O2  
CONECT     1H3     1 C3  
CONECT     2H3     1 C3  
CONECT      H4     1 N4  
CONECT     1H9     1 C9  
CONECT     2H9     1 C9  
CONECT     3H9     1 C9  
CONECT      H15    1 C15 
CONECT      H16    1 C16 
CONECT      H17    1 C17 
CONECT      H18    1 C18 
END
HET    105             32
HETSYN     105 CLOXACILLIN DERIVATIVE
HETNAM     105 N-[5-METHYL-3-O-TOLYL-ISOXAZOLE-4-CARBOXYLIC ACID
HETNAM   2 105 AMIDE] BORONIC ACID
FORMUL      105    C12 H12 N2 O4 B1 CL1

Excerpt from heterogen dictionary file (input)

RESIDUE   061     58
CONECT      N1     2 N2   C5
CONECT      N2     2 N1   N3
CONECT      N3     2 N2   N4
CONECT      N4     3 N3   C5   HN4
CONECT      C5     3 N1   N4   C6
CONECT      C6     3 C5   C7   C11
< data ommitted for clarity >
END
HET    061             58
HETSYN     061 L-159,061
HETNAM     061 2-BUTYL-6-HYDROXY-3-[2'-(1H-TETRAZOL-5-YL)-BIPHENYL-4-
HETNAM   2 061 YLMETHYL]-3H-QUINAZOLIN-4-ONE
FORMUL      061    C26 H24 N6 O2
**
RESIDUE   072     90
CONECT      S1B    2 C1B  C2A
CONECT      C1A    3 C1B  O1A  N3A
CONECT      C1B    4 S1B  C1A  C1C  H1B
CONECT      O1A    1 C1A
CONECT      C1C    4 C1B  C1D 1H1C 2H1C
< data ommitted for clarity >
END
HET    072             90
HETSYN     072 THIAZOLIDINONE; GW0072
HETNAM     072 (+/-)(2S,5S)-3-(4-(4-CARBOXYPHENYL)BUTYL)-2-HEPTYL-4-
HETNAM   2 072 OXO-5-THIAZOLIDINE
FORMUL      072    C37 H46 N2 O4 S1
**
RESIDUE   074     58
CONECT      C1     4 C2  1H1  2H1  3H1
CONECT      C2     4 C1   C3  1H2  2H2
CONECT      C3     4 C2   N1  1H3  2H3
CONECT      N1     3 C3   C4  1HN1
CONECT      C4     3 N1   O1   C5
CONECT      O1     1 C4
< data ommitted for clarity >
END
HET    074             58
HETSYN     074 CA-074; [N-(L-3-TRANS-PROPYLCARBAMOYL-OXIRANE-2-
HETSYN   2 074 CARBONYL)-L-ISOLEUCYL-L-PROLINE]
HETNAM     074 [PROPYLAMINO-3-HYDROXY-BUTAN-1,4-DIONYL]-ISOLEUCYL-
HETNAM   2 074 PROLINE
FORMUL      074    C18 H31 N3 O6
< data ommitted for clarity > 

Output file format

Output files for usage example

File: Ehet.dat

ID   105
DE   N-[5-METHYL-3-O-TOLYL-ISOXAZOLE-4-CARBOXYLIC ACIDAMIDE] BORONIC ACID
SY   CLOXACILLIN DERIVATIVE
NN   0
//
ID   104
DE   N,N'-BIS(2-AMINOETHYL)-1,2-ETHANEDIAMINE
SY   TRIENTINE
NN   0
//
ID   103
DE   2',5'-DIDEOXY-ADENOSINE 3'-MONOPHOSPHATE
SY   .
NN   0
//
ID   102
DE   GAMMA-DEOXY-GAMMA-SULFO-GUANOSINE-5'-TRIPHOSPHATE
SY   .
NN   0
//
ID   101
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
SY   .
NN   0
//
ID   100
DE   1-(5-CHLOROINDOL-3-YL)-3-HYDROXY-3-(2H-TETRAZOL-5-YL)-PROPENONE
SY   .
NN   0
//
ID   074
DE   [PROPYLAMINO-3-HYDROXY-BUTAN-1,4-DIONYL]-ISOLEUCYL-PROLINE
SY   CA-074;
SY   [N-(L-3-TRANS-PROPYLCARBAMOYL-OXIRANE-2-CARBONYL)-L-ISOLEUCYL-L-PROLINE]
NN   0
//
ID   072
DE   
DE   (+/-)(2S,5S)-3-(4-(4-CARBOXYPHENYL)BUTYL)-2-HEPTYL-4-OXO-5-THIAZOLIDINE
SY   THIAZOLIDINONE; GW0072
NN   0
//
ID   061
DE   
DE   2-BUTYL-6-HYDROXY-3-[2'-(1H-TETRAZOL-5-YL)-BIPHENYL-4-YLMETHYL]-3H-QUINAZOLIN-4-ONE
SY   L-159,061
NN   0
//

The records used in the output file (below) are as follows:

(1) ID - 3-character abbreviation of heterogen
(2) DE - full description
(3) SY - synonym
(4) NN - no. of files which this heterogen appears in

Example of hetparse output file

ID  061
DE  2-BUTYL-6-HYDROXY-3-[2'-(1H-TETRAZOL-5-YL)-BIPHENYL-4-YLMETHYL]-3H-QUINAZ
DE  OLIN-4-ONE
SY  L-159,061
NN  2
//
ID  072
DE  (+/-)(2S,5S)-3-(4-(4-CARBOXYPHENYL)BUTYL)-2-HEPTYL-4-OXO-5-THIAZOLIDINE
SY  THIAZOLIDINONE; GW0072
NN  10
//
ID  074
DE  [PROPYLAMINO-3-HYDROXY-BUTAN-1,4-DIONYL]-ISOLEUCYL-PROLINE
SY  CA-074; [N-(L-3-TRANS-PROPYLCARBAMOYL-OXIRANE-2-CARBONYL)-L-ISOLEUCYL-L-P
SY  ROLINE]
NN  1
//      

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
aaindexextractExtract data from AAINDEX
allversusallDoes an all-versus-all global alignment for each set of sequences in an input directory and writes files of sequence similarity values
cathparseReads raw CATH classification files and writes DCF file (domain classification file)
cutgextractExtract data from CUTG
domainerReads CCF files (clean coordinate files) for proteins and writes CCF files for domains, taken from a DCF file (domain classification file)
domainnrRemoves redundant domains from a DCF file (domain classification file). The file must contain domain sequence information, which can be added by using DOMAINSEQS
domainseqsAdds sequence records to a DCF file (domain classification file)
domainsseAdds secondary structure records to a DCF file (domain classification file)
pdbparseParses PDB files and writes CCF files (clean coordinate files) for proteins
pdbplusAdd residue solvent accessibility and secondary structure data to a CCF file (clean coordinate file) for a protein or domain
pdbtospConvert raw swissprot:PDB equivalence file to EMBL-like format
printsextractExtract data from PRINTS
prosextractBuilds the PROSITE motif database for patmatmotifs to search
rebaseextractExtract data from REBASE
scopparseReads raw SCOP classification files and writes a DCF file (domain classification file)
seqnrRemoves redundancy from DHF files (domain hits files) or other files of sequences
sitesReads CCF files (clean coordinate files) and writes CON files (contact files) of residue-ligand contact data for domains in a DCF file (domain classification file)
ssematchSearches a DCF file (domain classification file) for secondary structure matches
tfextractExtract data from TRANSFAC

funky uses the hetparse output file as input.

Author(s)

Jon Ison (jison © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

Waqas Awan (wawan © hgmp.mrc.ac.uk)
HGMP-RC, Genome Campus, Hinxton, Cambridge CB10 1SB, UK

History

Written (2003) - Jon Ison & Waqas Awan

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.