EMBOSS: vectorstrip


Program vectorstrip

Function

Strips out DNA between a pair of vector sequences

Description

vectorstrip is intended to be useful for stripping vector sequence from the ends of sequences of interest. For example, if a fragment has been cloned into a vector and then sequenced, the sequence may contain vector data eg from the cloning polylinker at the 5' and 3' ends of the sequence. vectorstrip will remove these contaminating regions and output trimmed sequence ready for input into another application.

vectorstrip is suitable for use with low quality sequence data as it can allow for mismatches between the sequence and the vector patterns provided. You can specify the maximum level of mismatch expected.

Vector data can either be provided in a file or interactively. If presented in a file, vectorstrip will search all input sequences with all vectors listed in that file. The intention is that the user can maintain a single file for use with vectorstrip, containing all the linker sequences commonly used in the laboratory.

The two patterns for each vector are searched separately against the sequence. Once the search is completed, each of the hits of the 5' sequence is paired with each of the hits of the 3' sequence and the resulting subsequences are output. For example, if the 5' sequence matches the sequence from (a) position 30-60, and(b)position 70-100, and the 3' sequence matches from 150-175, then two subsequences will be output: from 61-149, and from 101-149. The lower the quality of the sequence, the more likely multiple hits become if nonzero mismatches are accepted.

Default behaviour is to report only the best matches between the vector patterns and the sequence. This means that if you specify a maximum mismatch level of 10%, but the vector patterns match the sequence with zero mismatches, the search will stop and the program will output only these "best" matches. If there are no perfect matches, the program will try searching again allowing 1 mismatch, then 2, and so on until either the patterns match the sequence or the maximum specified mismatch level is exceeded. You can tell vectorstrip to show all possible matches up to your specified maximum level, as illustrated in the examples below.

Usage

Here are several examples of running vectorstrip. The vectorfile and the sequence files used in these example are given below. In each case, the same fragment has been cloned into the XhoI site of the polylinker of each vector. The cloned fragment is represented in lower case and the vector sequence in upper case so the sequence trimming can be readily seen.
  1. Running vectorstrip on a list of sequences with default parameters
    unix % vectorstrip @seqs.list
    Strips out DNA between a pair of vector sequences
    Are your vector sequences in a file? [Y]: 
    Name of vectorfile: vectors
    Max allowed % mismatch [10]: 
    Show only the best hits (minimise mismatches)? [Y]: 
    Output file [pbluescript.vectorstrip]:stdout 
    Output sequence [pbluescript.fasta]: 
    
    Sequence: pBlueScript    Vector: pTYB1  No match
    
    
    Sequence: pBlueScript    Vector: pBS_KS+
    5' sequence matches:
            From 67 to 83 with 0 mismatches
    3' sequence matches:
            From 205 to 219 with 0 mismatches
    Sequences output to file:
            from 84 to 204
                    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagca
                    cacagtgacatgagagacacagatatagagacagatagacgatagacaga
                    cagcatatatagacagatagc
            sequence trimmed from 5' end:
                    GGAAACAGCTAATGACCATGATTACGCCAAGCGCGCAATTAACCCTCACT
                    AAAGGGAACAAAAGCTGGGTACCGGGCCCCCCC
            sequence trimmed from 3' end:
                    TCGAGGTCGACGGTATCGATAAGCTTGATATCG
    
    
    Sequence: pBlueScript    Vector: pLITMUS        No match
    
    Sequence: litmus.seq     Vector: pTYB1  No match
    
    Sequence: litmus.seq     Vector: pBS_KS+        No match
    
    
    Sequence: litmus.seq     Vector: pLITMUS
    5' sequence matches:
            From 43 to 61 with 0 mismatches
    3' sequence matches:
            From 183 to 199 with 0 mismatches
    Sequences output to file:
            from 62 to 182
                    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagca
                    cacagtgacatgagagacacagatatagagacagatagacgatagacaga
                    cagcatatatagacagatagc
            sequence trimmed from 5' end:
                    TCTAGAACCGGTGACGTCTCCCATGGTGAAGCTTGGATCCACGATATCCT
                    GCAGGAATTCC
    
    
    Sequence: pTYB1.seq      Vector: pTYB1
    5' sequence matches:
            From 40 to 58 with 0 mismatches
    3' sequence matches:
            From 180 to 196 with 0 mismatches
    Sequences output to file:
            from 59 to 179
                    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagca
                    cacagtgacatgagagacacagatatagagacagatagacgatagacaga
                    cagcatatatagacagatagc
            sequence trimmed from 5' end:
                    CTTTAAGAAGGAGATATACATATGGCTAGCTCGCGAGTCGACGGCGGCCG
                    CGAATTCC
            sequence trimmed from 3' end:
                    TCGAGGGCTCTTCCTGCTTTGCCAAGGGTACCAATGTTTTAATGGCGGAT
    
    
    Sequence: pTYB1.seq      Vector: pBS_KS+        No match
    
    Sequence: pTYB1.seq      Vector: pLITMUS        No match
    
    
    unix % more pbluescript.fasta 
    >pBlueScript_from_84_to_204 KS+
    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagcacacagtgaca
    tgagagacacagatatagagacagatagacgatagacagacagcatatatagacagatag
    c
    >litmus.seq_from_62_to_182
    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagcacacagtgaca
    tgagagacacagatatagagacagatagacgatagacagacagcatatatagacagatag
    c
    >pTYB1.seq_from_59_to_179
    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagcacacagtgaca
    tgagagacacagatatagagacagatagacgatagacagacagcatatatagacagatag
    c
    
  2. Running vectorstrip allowing maximum 30% mismatch, and asking only for best hits
    unix % vectorstrip litmus.seq
    Strips out DNA between a pair of vector sequences
    Are your vector sequences in a file? [Y]: 
    Name of vectorfile: vectors
    Max allowed % mismatch [10]: 30
    Show only the best hits (minimise mismatches)? [Y]: 
    Output file [litmus.vectorstrip]: stdout
    Output sequence [litmus.fasta]: 
    
    Sequence: litmus.seq     Vector: pTYB1  No match
    
    Sequence: litmus.seq     Vector: pBS_KS+        No match
    
    
    Sequence: litmus.seq     Vector: pLITMUS
    5' sequence matches:
            From 43 to 61 with 0 mismatches
    3' sequence matches:
            From 183 to 199 with 0 mismatches
    Sequences output to file:
            from 62 to 182
                    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagca
                    cacagtgacatgagagacacagatatagagacagatagacgatagacaga
                    cagcatatatagacagatagc
            sequence trimmed from 5' end:
                    TCTAGAACCGGTGACGTCTCCCATGGTGAAGCTTGGATCCACGATATCCT
                    GCAGGAATTCC
            sequence trimmed from 3' end:
                    TCGAGACCGTACGTGCGCGCGAATGCATCCAGATCTTCCCTCTAGTCAAG
                    GCCTTAAGTGAGTCGTATTACGGA
    
  3. Running vectorstrip allowing maximum 30% mismatch, and asking for all hits
    unix % vectorstrip litmus.seq
    Strips out DNA between a pair of vector sequences
    Are your vector sequences in a file? [Y]: 
    Name of vectorfile: vectors
    Max allowed % mismatch [10]: 30
    Show only the best hits (minimise mismatches)? [Y]: N
    Output file [litmus.vectorstrip]: stdout
    Output sequence [litmus.fasta]: 
    
    Sequence: litmus.seq     Vector: pTYB1  No match
    
    Sequence: litmus.seq     Vector: pBS_KS+        No match
    
    
    Sequence: litmus.seq     Vector: pLITMUS
    5' sequence matches:
            From 43 to 61 with 0 mismatches
    3' sequence matches:
            From 183 to 199 with 0 mismatches
            From 228 to 244 with 5 mismatches
    Sequences output to file:
            from 62 to 182
                    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagca
                    cacagtgacatgagagacacagatatagagacagatagacgatagacaga
                    cagcatatatagacagatagc
            sequence trimmed from 5' end:
                    TCTAGAACCGGTGACGTCTCCCATGGTGAAGCTTGGATCCACGATATCCT
                    GCAGGAATTCC
            sequence trimmed from 3' end:
                    TCGAGACCGTACGTGCGCGCGAATGCATCCAGATCTTCCCTCTAGTCAAG
                    GCCTTAAGTGAGTCGTATTACGGA
    
            from 62 to 227
                    tcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagca
                    cacagtgacatgagagacacagatatagagacagatagacgatagacaga
                    cagcatatatagacagatagcTCGAGACCGTACGTGCGCGCGAATGCATC
                    CAGATCTTCCCTCTAG
            sequence trimmed from 5' end:
                    TCTAGAACCGGTGACGTCTCCCATGGTGAAGCTTGGATCCACGATATCCT
                    GCAGGAATTCC
            sequence trimmed from 3' end:
                    TCAAGGCCTTAAGTGAGTCGTATTACGGA
    
  4. Running vectorstrip against a sequence containing Ns
    unix % vectorstrip pTYB1_N.seq
    Strips out DNA between a pair of vector sequences
    Are your vector sequences in a file? [Y]: 
    Name of vectorfile: vectors
    Max allowed % mismatch [10]: 30
    Show only the best hits (minimise mismatches)? [Y]: 
    Output file [ptyb1.vectorstrip]: stdout
    Output sequence [ptyb1.fasta]: 
    
    
    Sequence: pTYB1.seq      Vector: pTYB1
    5' sequence matches:
            From 40 to 58 with 2 mismatches
    3' sequence matches:
            From 180 to 196 with 2 mismatches
    Sequences output to file:
            from 59 to 179
                    tcnagagccgtatngcgatatngcgcacatgcgntggacacagangagca
                    cacagtnacatgagagncacagatntagagacagatngacgataganaga
                    cagcatanatagacanatagc
            sequence trimmed from 5' end:
                    CTTTAAGNAGGAGANATACANATGGCNAGCTCGCGANTCGACGGCGGNCG
                    CGAATNCC
            sequence trimmed from 3' end:
                    TCGNGGGCTCTTCCNGCTTTGCCANGGGTACCAANGTTTTAATGGCNGAT
    
    
    Sequence: pTYB1.seq      Vector: pBS_KS+        No match
    
    Sequence: pTYB1.seq      Vector: pLITMUS        No match
    
    

Command line arguments

   Mandatory qualifiers (* if not always prompted):
  [-sequence]          seqall     Sequence database USA
  [-[no]vectorfile]    bool       Are your vector sequences in a file?
* [-vectors]           infile     Name of vectorfile
*  -linkera            string     5' sequence
*  -linkerb            string     3' sequence
   -mismatch           integer    Max allowed % mismatch
   -[no]besthits       bool       Show only the best hits (minimise
                                  mismatches)?
  [-outf]              outfile    Output file name
  [-outseq]            seqoutall  Output sequence(s) USA

   Optional qualifiers: (none)
   Advanced qualifiers: (none)

Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
[-[no]vectorfile]
(Parameter 2)
Are your vector sequences in a file? Yes/No Yes
[-vectors]
(Parameter 3)
Name of vectorfile Input file Required
-linkera 5' sequence Any string is accepted An empty string is accepted
-linkerb 3' sequence Any string is accepted An empty string is accepted
-mismatch Max allowed % mismatch Any integer value 10
-[no]besthits Show only the best hits (minimise mismatches)? Yes/No Yes
[-outf]
(Parameter 4)
Output file name Output file <sequence>.vectorstrip
[-outseq]
(Parameter 5)
Output sequence(s) USA Writeable sequence(s) <sequence>.format
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

Each line should contain the name of the vector, the 5' pattern and the 3' pattern.
Lines beginning with # are treated as comments and ignored.
If only one vector sequence is given in the it will be assumed that this is the 5' pattern.
If a vector name is given but no pattern data, the vector will not be used.

For example, the file used in the examples above is:

pTYB1	GACGGCGGCCGCGAATTCC	TCGAGGGCTCTTCCTGC
pBS_KS+	GGGTACCGGGCCCCCCC	TCGAGGTCGACGGTA		
pLITMUS	GATATCCTGCAGGAATTCC	TCGAGACCGTACGTGCG

The sequence files used in the examples above are:

  1. seqs.list
    bluescript.seq
    litmus.seq
    pTYB1.seq
    
  2. bluescript.seq
    > pBlueScript KS+
    GGAAACAGCTAATGACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGAAC
    AAAAGCTGGGTACCGGGCCCCCCCtcgagagccgtattgcgatatagcgcacatgcgtt
    ggacacagatgagcacacagtgacatgagagacacagatatagagacagatagacgata
    gacagacagcatatatagacagatagcTCGAGGTCGACGGTATCGATAAGCTTGATATCG
    
  3. litmus.seq
    > litmus.seq 
    TCTAGAACCGGTGACGTCTCCCATGGTGAAGCTTGGATCCACGATATCCTGCAGGAATT
    CCtcgagagccgtattgcgatatagcgcacatgcgttggacacagatgagcacacagtg
    acatgagagacacagatatagagacagatagacgatagacagacagcatatatagacag
    atagcTCGAGACCGTACGTGCGCGCGAATGCATCCAGATCTTCCCTCTAGTCAAGGCCT
    TAAGTGAGTCGTATTACGGA
    
  4. pTYB1.seq
    >pTYB1.seq
    CTTTAAGAAGGAGATATACATATGGCTAGCTCGCGAGTCGACGGCGGCCGCGAATTCCtc
    gagagccgtattgcgatatagcgcacatgcgttggacacagatgagcacacagtgacatg
    agagacacagatatagagacagatagacgatagacagacagcatatatagacagatagcT
    CGAGGGCTCTTCCTGCTTTGCCAAGGGTACCAATGTTTTAATGGCGGAT
    
  5. pTYB1_N.seq
    >pTYB1.seq lower quality
    CTTTAAGNAGGAGANATACANATGGCNAGCTCGCGANTCGACGGCGGNCGCGAATNCCtc
    nagagccgtatngcgatatngcgcacatgcgntggacacagangagcacacagtnacatg
    agagncacagatntagagacagatngacgataganagacagcatanatagacanatagcT
    CGNGGGCTCTTCCNGCTTTGCCANGGGTACCAANGTTTTAATGGCNGAT
    
    
In each case the vector sequence is given in upper case and the cloned sequence in lower case.

Output file format

Two types of output file are produced:
  1. The sequence file(s) - contain the trimmed subsequence(s) produced by vectorstrip either all in one file, or in separate files if the command line flag -ossingle is used.
  2. Results summary file eg
    Sequence: pTYB1.seq      Vector: pTYB1
    5' sequence matches:
            From 40 to 58 with 2 mismatches
    3' sequence matches:
            From 180 to 196 with 2 mismatches
    Sequences output to file:
            from 59 to 179
                    tcnagagccgtatngcgatatngcgcacatgcgntggacacagangagca
                    cacagtnacatgagagncacagatntagagacagatngacgataganaga
                    cagcatanatagacanatagc
            sequence trimmed from 5' end:
                    CTTTAAGNAGGAGANATACANATGGCNAGCTCGCGANTCGACGGCGGNCG
                    CGAATNCC
            sequence trimmed from 3' end:
                    TCGNGGGCTCTTCCNGCTTTGCCANGGGTACCAANGTTTTAATGGCNGAT
    
    

Data files

Notes

References

Warnings

Diagnostic Error Messages

  1. No suitable vectors found - exiting
    indicates that the 5' and 3' patterns for the vectors were blank - usually this is as a result of an empty vectorfile.
  2. Illegal pattern
    indicates that one of the vector patterns could not be compiled and therefore cannot be searched.
  3. 5' and 3' sequence matches are identical; inconclusive
    indicates that the 5' and 3' patterns provided were identical, and that they only match the sequence once. Thus the program cannot determine which part of the sequence is vector and which is insert.

Exit status

It always exits with status 0.

Known bugs

See also

Program nameDescription
cutseqRemoves a specified section from a sequence
descseqAlter the name or description of a sequence
extractseqExtract regions from a sequence
maskfeatMask off features of a sequence
maskseqMask off regions of a sequence
megamergerMerge two large overlapping nucleic acid sequences
mergerMerge two overlapping nucleic acid sequences
newseqType in a short new sequence
noreturnRemoves carriage return from ASCII files
nthseqWrites one sequence from a multiple set of sequences
pasteseqInsert one sequence into another
revseqReverse and complement a sequence
splitterSplit a sequence into (overlapping) smaller sequences
trimseqTrim ambiguous bits off the ends of sequences

Author(s)

This application was written by Val Curwen (vcurwen@hgmp.mrc.ac.uk)

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments