EMBOSS: tfscan


Program tfscan

Function

Scans DNA sequences for transcription factors

Description

The TRANSFAC Database is a database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. It covers the whole range from yeast to human.

The SITE data from TRANSFAC contains information on individual (putatively) regulatory protein binding sites. It has been divided into the following taxonomic groups.

The program tfscan takes a sequence and the name of one of these taxonomic groups and does a fast match of the TRANSFAC sequences against the input sequence (optionally allowing mismatches).

The results is a list of the positions which match the binding sites in the TRANSFAC SITE database.

Because the binding sites are so small, there will be many spurious (false positive) matches.

Usage

Here is a sample session with tfscan.

% tfscan
Input sequence(s): embl:hsfos
Transcription Factor Class
         F : fungi
         I : insect
         P : plant
         V : vertebrate
         O : other
Select class [V]: v
Number of mismatches [0]: 
Output file [hsfos.tfscan]: 

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
   -menu               list       Select class
   -mismatch           integer    Number of mismatches
  [-outfile]           outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
-menu Select class
F (fungi)
I (insect)
P (plant)
V (vertebrate)
O (other)
V
-mismatch Number of mismatches Integer 0 or more 0
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.tfscan
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

Normal nucleic acid sequence USA.

Output file format

Here is part of the output from the example run. Some of the sites in TRANSFAC are extremely short.


TFSCAN of HSFOS from 1 to 6210

HS$CFOS_20           R08485   384   396   agttcccgtcaat
DOG$ATP1A_01         R08484   3057  3063  gacatgg
HS$CEBPA_01          R08471   4535  4540  cacgtg
HS$GPB_05            R08210   3716  3721  gtatct
HS$GPB_02            R08207   5837  5842  ggtggg
HS$GPB_02            R08207   2399  2404  ggtggg
HS$GPB_02            R08207   2077  2082  ggtggg
MOUSE$TIMP1_02       R08168   2362  2368  caggaag
MOUSE$BMG_11         R08167   960   965   ggatag
RAT$RPK_02           R08166   6136  6140  tgtgc
RAT$RPK_02           R08166   5953  5957  tgtgc
RAT$RPK_02           R08166   4433  4437  tgtgc
RAT$RPK_02           R08166   4143  4147  tgtgc
RAT$RPK_02           R08166   3450  3454  tgtgc
RAT$RPK_02           R08166   3246  3250  tgtgc
RAT$RPK_02           R08166   3154  3158  tgtgc
RAT$RPK_02           R08166   1128  1132  tgtgc
HS$TIMP1_02          R08152   2361  2368  ccaggaag
HS$TIMP1_01          R08151   2642  2648  tgagtaa
HS$IL3_08            R05028   4376  4381  tgtggg
HS$IL3_08            R05028   3471  3476  tgtggg
HS$IL3_08            R05028   2584  2589  tgtggg
HS$IL3_08            R05028   2066  2071  tgtggg
HS$CATHD_01          R04883   1430  1435  ggcggg
HS$CATHD_01          R04883   1092  1097  ggcggg
HS$CATHD_01          R04883   569   574   ggcggg
RAT$IGFBP2_02        R04793   5123  5128  gggcgg
RAT$IGFBP2_02        R04793   1429  1434  gggcgg
RAT$IGFBP2_02        R04793   1091  1096  gggcgg
RAT$IGFBP2_02        R04793   607   612   gggcgg
RAT$IGFBP2_01        R04792   5123  5128  gggcgg
RAT$IGFBP2_01        R04792   1429  1434  gggcgg
RAT$IGFBP2_01        R04792   1091  1096  gggcgg
RAT$IGFBP2_01        R04792   607   612   gggcgg
HS$A14COL_01         R04791   5123  5128  gggcgg
HS$A14COL_01         R04791   1429  1434  gggcgg
HS$A14COL_01         R04791   1091  1096  gggcgg
HS$A14COL_01         R04791   607   612   gggcgg
HS$A24COL_03         R04790   5123  5128  gggcgg
etc......

The output consists of a title line then 5 columns separated by whitespace.

The first column is the identifier of the entry.

The second column is the Accession Number of the entry.

The third and fourth columns are the start and end positions of the match in your input sequence.

The fifth column is the sequence of the region where a match has been found.

For further details on an entry from the TRANSFAC database, see:
http://transfac.gbf.de/cgi-bin/qt/search.pl

Data files

tfscan reads the TRANSFAC SITE data held in the EMBOSS data files:

Your EMBOSS administrator will have to run the EMBOSS program tfextract in order to set these files up from the TRANSFAC distribution files.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

None.

References

Warnings

Your EMBOSS administrator will have to run the EMBOSS program tfextract in order to set up the data files from the TRANSFAC distribution files.

Diagnostic Error Messages

"EMBOSS An error in tfscan.c at line 82:
Either EMBOSS_DATA undefined or TFEXTRACT needs running"

This means that you should contact your EMBOSS administrator and ask them to run the tfextract program to set up the TRANSFAC data for EMBOSS.

Exit status

It always exits with a status of 0.

Known bugs

None.

See also

Program nameDescription

Your EMBOSS administrator will have to run the EMBOSS program tfextract in order to set up the data files from the TRANSFAC distribution files.

Author(s)

This application was written by Alan Bleasby (ableasby@hgmp.mrc.ac.uk)

History

Written Summer 2000 - Alan Bleasby

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments