fseqbootall

 

Function

Bootstrapped sequences algorithm

Description

Reads in a data set, and produces multiple data sets from it by bootstrap resampling. Since most programs in the current version of the package allow processing of multiple data sets, this can be used together with the consensus tree program CONSENSE to do bootstrap (or delete-half-jackknife) analyses with most of the methods in this package. This program also allows the Archie/Faith technique of permutation of species within characters. It can also rewrite a data set to convert it from between the PHYLIP Interleaved and Sequential forms, and into a preliminary version of a new XML sequence alignment format which is under development

Algorithm

SEQBOOT is a general bootstrapping and data set translation tool. It is intended to allow you to generate multiple data sets that are resampled versions of the input data set. Since almost all programs in the package can analyze these multiple data sets, this allows almost anything in this package to be bootstrapped, jackknifed, or permuted. SEQBOOT can handle molecular sequences, binary characters, restriction sites, or gene frequencies. It can also convert data sets between Sequential and Interleaved format, and into the NEXUS format or into a new XML sequence alignment format.

To carry out a bootstrap (or jackknife, or permutation test) with some method in the package, you may need to use three programs. First, you need to run SEQBOOT to take the original data set and produce a large number of bootstrapped or jackknifed data sets (somewhere between 100 and 1000 is usually adequate). Then you need to find the phylogeny estimate for each of these, using the particular method of interest. For example, if you were using DNAPARS you would first run SEQBOOT and make a file with 100 bootstrapped data sets. Then you would give this file the proper name to have it be the input file for DNAPARS. Running DNAPARS with the M (Multiple Data Sets) menu choice and informing it to expect 100 data sets, you would generate a big output file as well as a treefile with the trees from the 100 data sets. This treefile could be renamed so that it would serve as the input for CONSENSE. When CONSENSE is run the majority rule consensus tree will result, showing the outcome of the analysis.

This may sound tedious, but the run of CONSENSE is fast, and that of SEQBOOT is fairly fast, so that it will not actually take any longer than a run of a single bootstrap program with the same original data and the same number of replicates. This is not very hard and allows bootstrapping or jackknifing on many of the methods in this package. The same steps are necessary with all of them. Doing things this way some of the intermediate files (the tree file from the DNAPARS run, for example) can be used to summarize the results of the bootstrap in other ways than the majority rule consensus method does.

If you are using the Distance Matrix programs, you will have to add one extra step to this, calculating distance matrices from each of the replicate data sets, using DNADIST or GENDIST. So (for example) you would run SEQBOOT, then run DNADIST using the output of SEQBOOT as its input, then run (say) NEIGHBOR using the output of DNADIST as its input, and then run CONSENSE using the tree file from NEIGHBOR as its input.

The resampling methods available are:

Usage

Here is a sample session with fseqbootall


% fseqbootall -seed 3 
Bootstrapped sequences algorithm
Input sequence set: seqboot.dat
Output file [seqboot.fseqbootall]: 


 bootstrap: true
jackknife: false
 permute: false
 lockhart: false
 ild: false
 justwts: false 

completed replicate number   10
completed replicate number   20
completed replicate number   30
completed replicate number   40
completed replicate number   50
completed replicate number   60
completed replicate number   70
completed replicate number   80
completed replicate number   90
completed replicate number  100

Output written to file "seqboot.fseqbootall"

Done.


Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-infile]            seqset     Sequence set USA
  [-outfile]           outfile    Output file name

   Additional (Optional) qualifiers (* if not always prompted):
   -categories         properties File of input categories
   -mixfile            properties File of mixtures
   -ancfile            properties File of ancestors
   -weights            properties Weights file
   -factorfile         properties Factors file
   -datatype           menu       Choose the datatype
   -test               menu       Choose test
*  -regular            toggle     Altered sampling fraction
*  -fracsample         float      Samples as percentage of sites
*  -rewriteformat      menu       Output format
*  -seqtype            menu       Output format
*  -morphseqtype       menu       Output format
*  -blocksize          integer    Block size for bootstraping
*  -reps               integer    How many replicates
*  -justweights        menu       Write out datasets or just weights
*  -enzymes            boolean    Is the number of enzymes present in input
                                  file
*  -all                boolean    All alleles present at each locus
*  -seed               integer    Random number seed between 1 and 32767 (must
                                  be odd)
   -printdata          boolean    Print out the data at start of run
*  -[no]dotdiff        boolean    Use dot-differencing
   -[no]progress       boolean    Print indications of progress of run

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-infile" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-infile]
(Parameter 1)
Sequence set USA Readable set of sequences Required
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.fseqbootall
Additional (Optional) qualifiers Allowed values Default
-categories File of input categories Property value(s)  
-mixfile File of mixtures Property value(s)  
-ancfile File of ancestors Property value(s)  
-weights Weights file Property value(s)  
-factorfile Factors file Property value(s)  
-datatype Choose the datatype
s (Molecular sequences)
m (Discrete Morphology)
r (Restriction Sites)
g (Gene Frequencies)
s
-test Choose test
b (Bootstrap)
j (Jackknife)
c (Permute species for each character)
o (Permute character order)
s (Permute within species)
r (Rewrite data)
b
-regular Altered sampling fraction Toggle value Yes/No No
-fracsample Samples as percentage of sites Number from 0.100 to 100.000 100.0
-rewriteformat Output format
p (PHYLIP)
n (NEXUS)
x (XML)
p
-seqtype Output format
d (dna)
p (protein)
r (rna)
d
-morphseqtype Output format
p (PHYLIP)
n (NEXUS)
p
-blocksize Block size for bootstraping Integer 1 or more 1
-reps How many replicates Integer 1 or more 100
-justweights Write out datasets or just weights
d (Datasets)
w (Weights)
d
-enzymes Is the number of enzymes present in input file Boolean value Yes/No No
-all All alleles present at each locus Boolean value Yes/No No
-seed Random number seed between 1 and 32767 (must be odd) Integer from 1 to 32767 1
-printdata Print out the data at start of run Boolean value Yes/No No
-[no]dotdiff Use dot-differencing Boolean value Yes/No Yes
-[no]progress Print indications of progress of run Boolean value Yes/No Yes
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

fseqbootall data files read by SEQBOOT are the standard ones for the various kinds of data. For molecular sequences the sequences may be either interleaved or sequential, and similarly for restriction sites. Restriction sites data may either have or not have the third argument, the number of restriction enzymes used. Discrete morphological characters are always assumed to be in sequential format. Gene frequencies data start with the number of species and the number of loci, and then follow that by a line with the number of alleles at each locus. The data for each locus may either have one entry for each allele, or omit one allele at each locus. The details of the formats are given in the main documentation file, and in the documentation files for the groups of programsreads any normal sequence USAs.

Input files for usage example

File: seqboot.dat

    5    6
Alpha     AACAAC
Beta      AACCCC
Gamma     ACCAAC
Delta     CCACCA
Epsilon   CCAAAC

Output file format

fseqbootall output will contain the data sets generated by the resampling process. Note that, when Gene Frequencies data is used or when Discrete Morphological characters with the Factors option are used, the number of characters in each data set may vary. It may also vary if there are an odd number of characters or sites and the Delete-Half-Jackknife resampling method is used, for then there will be a 50% chance of choosing (n+1)/2 characters and a 50% chance of choosing (n-1)/2 characters.

The Factors option causes the characters to be resampled together. If (say) three adjacent characters all have the same factors characters, so that they all are understood to be recoding one multistate character, they will be resampled together as a group.

The order of species in the data sets in the output file will vary randomly. This is a precaution to help the programs that analyze these data avoid any result which is sensitive to the input order of species from showing up repeatedly and thus appearing to have evidence in its favor.

The numerical options 1 and 2 in the menu also affect the output file. If 1 is chosen (it is off by default) the program will print the original input data set on the output file before the resampled data sets. I cannot actually see why anyone would want to do this. Option 2 toggles the feature (on by default) that prints out up to 20 times during the resampling process a notification that the program has completed a certain number of data sets. Thus if 100 resampled data sets are being produced, every 5 data sets a line is printed saying which data set has just been completed. This option should be turned off if the program is running in background and silence is desirable. At the end of execution the program will always (whatever the setting of option 2) print a couple of lines saying that output has been written to the output file.

Output files for usage example

File: seqboot.fseqbootall

    5     6
Alpha     AAACCA
Beta      AAACCC
Gamma     ACCCCA
Delta     CCCAAC
Epsilon   CCCAAA
    5     6
Alpha     AAACAA
Beta      AAACCC
Gamma     ACCCAA
Delta     CCCACC
Epsilon   CCCAAA
    5     6
Alpha     AAAAAC
Beta      AAACCC
Gamma     AACAAC
Delta     CCCCCA
Epsilon   CCCAAC
    5     6
Alpha     CCCCCA
Beta      CCCCCC
Gamma     CCCCCA
Delta     AAAAAC
Epsilon   AAAAAA
    5     6
Alpha     AAAACC
Beta      AAACCC
Gamma     AACACC
Delta     CCCCAA
Epsilon   CCCACC
    5     6
Alpha     AAAACC
Beta      ACCCCC
Gamma     AAAACC
Delta     CCCCAA
Epsilon   CAAACC
    5     6
Alpha     AACCAA
Beta      AACCCC
Gamma     ACCCAA
Delta     CCAACC
Epsilon   CCAAAA
    5     6
Alpha     AAAACC
Beta      ACCCCC
Gamma     AAAACC
Delta     CCCCAA
Epsilon   CAAACC
    5     6
Alpha     AACACC


  [Part of this file has been deleted for brevity]

Gamma     ACAAAA
Delta     CCCCCC
Epsilon   CCAAAA
    5     6
Alpha     AACAAC
Beta      AACCCC
Gamma     AACAAC
Delta     CCACCA
Epsilon   CCAAAC
    5     6
Alpha     AACAAA
Beta      AACCCC
Gamma     CCCAAA
Delta     CCACCC
Epsilon   CCAAAA
    5     6
Alpha     ACAAAA
Beta      ACCCCC
Gamma     CCAAAA
Delta     CACCCC
Epsilon   CAAAAA
    5     6
Alpha     CAAAAA
Beta      CCCCCC
Gamma     CAAAAA
Delta     ACCCCC
Epsilon   AAAAAA
    5     6
Alpha     CAACCC
Beta      CCCCCC
Gamma     CAACCC
Delta     ACCAAA
Epsilon   AAACCC
    5     6
Alpha     ACAACC
Beta      ACCCCC
Gamma     ACAACC
Delta     CACCAA
Epsilon   CAAACC
    5     6
Alpha     AAAAAA
Beta      AAAAAC
Gamma     ACCCCA
Delta     CCCCCC
Epsilon   CCCCCA
    5     6
Alpha     AACAAC
Beta      AACCCC
Gamma     CCCAAC
Delta     CCACCA
Epsilon   CCAAAC

Data files

None

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
ednacompDNA compatibility algorithm
ednadistNucleic acid sequence Distance Matrix program
ednainvarNucleic acid sequence Invariants method
ednamlPhylogenies from nucleic acid Maximum Likelihood
ednamlkPhylogenies from nucleic acid Maximum Likelihood with clock
ednaparsDNA parsimony algorithm
ednapennyPenny algorithm for DNA
eprotdistProtein distance algorithm
eprotparsProtein parsimony algorithm
erestmlRestriction site Maximum Likelihood method
eseqbootBootstrapped sequences algorithm
fdiscbootBootstrapped discrete sites algorithm
fdnacompDNA compatibility algorithm
fdnadistNucleic acid sequence Distance Matrix program
fdnainvarNucleic acid sequence Invariants method
fdnamlEstimates nucleotide phylogeny by maximum likelihood
fdnamlkEstimates nucleotide phylogeny by maximum likelihood
fdnamoveInteractive DNA parsimony
fdnaparsDNA parsimony algorithm
fdnapennyPenny algorithm for DNA
fdolmoveInteractive Dollo or Polymorphism Parsimony
ffreqbootBootstrapped genetic frequencies algorithm
fpromlProtein phylogeny by maximum likelihood
fpromlkProtein phylogeny by maximum likelihood
fprotdistProtein distance algorithm
fprotparsProtein pasimony algorithm
frestbootBootstrapped restriction sites algorithm
frestdistDistance matrix from restriction sites or fragments
frestmlRestriction site maximum Likelihood method
fseqbootBootstrapped sequences algorithm

Author(s)

This program is an EMBOSS conversion of a program written by Joe Felsenstein as part of his PHYLIP package.

Although we take every care to ensure that the results of the EMBOSS version are identical to those from the original package, we recommend that you check your inputs give the same results in both versions before publication.

Please report all bugs in the EMBOSS version to the EMBOSS bug team, not to the original author.

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None