![]() |
EMBOSS: diffseq |
diffseq should be of value when looking for SNPs, differences between strains of an organism and anything else that requires the differences between sequences to be highlighted.
The sequences can be very long. The program does a match of all sequence words of size 10 (by default). It then reduces this to the minimum set of overlapping matches by sorting the matches in order of size (largest size first) and then for each such match it removes any smaller matches that overlap. The result is a set of the longest ungapped alignments between the two sequences that do not overlap with each other. The mismatched regions between these matches are reported.
It should be possible to find differences between sequences that are Mega bytes long.
% diffseq embl:ap000504 embl:af129756 Find differences (SNPs) between nearly identical sequences Word size [10]: Output file [ap000504.diffseq]:
Mandatory qualifiers: [-asequence] sequence Sequence USA [-bsequence] sequence Sequence USA -wordsize integer Word size -outfile report Output report file Optional qualifiers: -afeatout featout File for output of first sequence's normal tab delimited gff's -bfeatout featout File for output of second sequence's normal tab delimited gff's -columns bool The default format for the output report file is to have several lines per difference giving the sequence positions, sequences and features. If this option is set true then the output report file's format is changed to a set of columns and no feature information is given. Advanced qualifiers: (none) General qualifiers: -help bool report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-asequence] (Parameter 1) |
Sequence USA | Readable sequence | Required |
[-bsequence] (Parameter 2) |
Sequence USA | Readable sequence | Required |
-wordsize | Word size | Integer 2 or more | 10 |
-outfile | Output report file | Report file | |
Optional qualifiers | Allowed values | Default | |
-afeatout | File for output of first sequence's normal tab delimited gff's | Writeable feature table | $(asequence.name).diffgff |
-bfeatout | File for output of second sequence's normal tab delimited gff's | Writeable feature table | $(bsequence.name).diffgff |
-columns | The default format for the output report file is to have several lines per difference giving the sequence positions, sequences and features. If this option is set true then the output report file's format is changed to a set of columns and no feature information is given. | Yes/No | No |
Advanced qualifiers | Allowed values | Default | |
(none) |
An example follows:
# Report of diffseq of: AF129756 and AP000504 AF129756 overlap starts at 6036 AP000504 overlap starts at 1 AF129756 6882-6882 Length: 1 Sequence: t Sequence: a AP000504 847-847 Length: 1 AF129756 7830-7830 Length: 1 Sequence: a Sequence: g AP000504 1795-1795 Length: 1 AF129756 8307 Length: 0 Feature: repeat_region 7920-8351 rpt_family="MSTB" Sequence: Sequence: t AP000504 2273-2273 Length: 1 AF129756 8500-8500 Length: 1 Feature: repeat_region 8391-8686 rpt_family="AluSg" Sequence: a Sequence: g AP000504 2466-2466 Length: 1 AF129756 8688 Length: 0 Feature: repeat_region 8687-8731 rpt_family="(CA)n" Sequence: Sequence: tgtg AP000504 2655-2658 Length: 4 AF129756 10945-10962 Length: 18 Feature: repeat_region 10910-10972 rpt_family="(CA)n" Sequence: gtgtgtgtgtgtgtgtgt Sequence: AP000504 4914 Length: 0 AF129756 10999-11001 Length: 3 Feature: repeat_region 10991-11020 rpt_family="AT_rich" Sequence: tat Sequence: aaa AP000504 4951-4953 Length: 3 AF129756 12647 Length: 0 Feature: repeat_region 12628-12930 rpt_family="AluSq" Sequence: Sequence: t AP000504 6600-6600 Length: 1 AF129756 12915-12915 Length: 1 Feature: repeat_region 12628-12930 rpt_family="AluSq" Sequence: a Sequence: g AP000504 6868-6868 Length: 1 AF129756 14264 Length: 0 Sequence: Sequence: tgtg AP000504 8218-8221 Length: 4 AF129756 15139-15139 Length: 1 Sequence: a Sequence: g AP000504 9096-9096 Length: 1 AF129756 17192-17192 Length: 1 Feature: repeat_region 16933-17231 rpt_family="AluSx" Sequence: a Sequence: c AP000504 11149-11149 Length: 1 [many lines removed for brevity] AF129756 overlap ends at 106028 AP000504 overlap ends at 100000 No. of SNPs = 86 No. of transitions = 58 No. of transversions = 28
The first line is the title giving the names of the sequences used.
The next two non-blank lines state the positions in each sequence where the detected overlap between them starts.
There then follows a set of reports of the mismatches between the sequences.
Each report consists of 4 or more lines.
This is followed by the equivalent information for the second sequence, but in the reverse order, namely 'Sequence:' line, 'Feature:' lines and line giving the position of the mismatch in the second sequence.
At the end of the report are two non-blank lines giving the positions in each sequence where the detected overlap between them ends.
The last three lines of the report gives the counts of SNPs (defined as a change of one nucleotide to one other nucleotide, no deletions or insertions are counted, no multi-base changes are counted).
The counts of transitions (Pyrimide to Pyrimidine or Purine to Purine) and transversions (Pyrimidine to Purine) are also given.
It should be noted that not all features are reported.
The 'source' feature found in all EMBL/Genbank feature table entries is not reported as this covers all of the sequence and so overlaps with any difference found in that sequence and so is uninformative and irritating. It has therefore been removed from the output report.
The translation information of CDS features is often extremely long and does not add useful information to the report. It has therefore been removed from the output report.
If no regions of alignment are found, the following output is given:
# Report of diffseq of: Fred and ECOMPA No regions of alignment found.
If the -columns qualifier is given then the output is given in a columnar format.
The columns are separated by one or more spaces or TAB characters in the order:
Note: no features are reported in this format in order to keep the columns lining up nicely.
For example:
# Report of diffseq of: AF129756 and AP000504 # AF129756 overlap starts at 6036 # AP000504 overlap starts at 1 # (AF129756) start end length sequence (AP000504) start end length sequence 6882 6882 1 't' 847 847 1 'a' 7830 7830 1 'a' 1795 1795 1 'g' 8307 8307 0 '' 2273 2273 1 't' 8500 8500 1 'a' 2466 2466 1 'g' 8688 8688 0 '' 2655 2658 4 'tgtg' 10945 10962 18 'gtgtgtgtgtgtgtgtgt' 4914 4914 0 '' 10999 11001 3 'tat' 4951 4953 3 'aaa' 12647 12647 0 '' 6600 6600 1 't' 12915 12915 1 'a' 6868 6868 1 'g' 14264 14264 0 '' 8218 8221 4 'tgtg' 15139 15139 1 'a' 9096 9096 1 'g' 17192 17192 1 'a' 11149 11149 1 'c' 19761 19761 1 'c' 13718 13718 1 'a' 20291 20291 1 't' 14248 14248 1 'c' 20462 20462 1 'c' 14419 14419 1 'g' 25686 25686 1 'c' 19643 19643 1 't' 26192 26192 1 't' 20149 20149 1 'c' 27227 27227 1 't' 21183 21183 0 '' 27359 27359 0 '' 21316 21319 4 'ctga' 27837 27837 1 't' 21797 21797 1 'c' 29328 29328 1 'a' 23288 23288 1 't' 29458 29458 1 'c' 23418 23418 1 'a' 29629 29629 1 'c' 23589 23589 1 't' 29646 29646 1 'a' 23606 23606 1 'g' 30838 30838 1 't' 24798 24798 1 'c' 31349 31349 1 't' 25309 25309 1 'c' 31901 31901 1 't' 25861 25861 1 'g' 34078 34078 0 '' 28039 28040 2 'ac' 36682 36682 1 't' 30644 30644 1 'c' 38225 38226 2 'gt' 32186 32186 0 '' 38379 38379 1 'g' 32339 32339 1 'c' 38537 38537 1 'c' 32497 32497 1 't' 39114 39114 1 'c' 33074 33074 1 't' 39816 39816 1 'a' 33776 33776 1 'g' 40807 40807 1 'c' 34767 34767 1 'a' 40977 40977 1 'a' 34936 34936 0 '' 41204 41204 1 'g' 35163 35163 1 'a' 42548 42548 1 'a' 36507 36507 1 'g' 43800 43800 0 '' 37760 37762 3 'aaa' 44717 44717 0 '' 38680 38683 4 'aata' 45315 45315 1 'a' 39280 39280 0 '' 48382 48382 1 'g' 42347 42347 1 'a' 48671 48671 0 '' 42637 42638 2 'gt' 50635 50635 1 'c' 44602 44602 1 't' 50809 50809 1 't' 44776 44776 1 'g' 51286 51286 1 'a' 45253 45253 1 'g' 51645 51645 1 't' 45611 45611 0 '' 52388 52388 1 't' 46354 46354 1 'c' 52646 52646 1 'g' 46612 46612 1 'a' 53596 53596 1 'g' 47562 47562 1 'a' 53621 53621 1 'c' 47587 47587 1 't' 54883 54883 1 'a' 48849 48849 1 'g' 55377 55377 1 'a' 49343 49343 1 'g' 55571 55571 1 't' 49537 49537 1 'c' 55611 55611 1 'c' 49577 49577 1 't' 55655 55661 7 'aaaaaaa' 49620 49620 0 '' 56357 56357 1 'g' 50316 50316 1 'a' 58115 58115 1 'c' 52074 52074 1 'a' 59922 59922 1 't' 53881 53881 1 'c' 60092 60092 1 'c' 54051 54051 1 'g' 63114 63114 1 't' 57073 57073 1 'a' 64267 64271 5 'gtttt' 58225 58225 0 '' 64731 64731 1 't' 58685 58685 1 'c' 66604 66604 1 'c' 60558 60558 1 't' 67254 67254 0 '' 61209 61209 1 'c' 69002 69002 0 '' 62958 62959 2 'tt' 69445 69445 1 'a' 63402 63402 1 'g' [many lines deleted for brevity] # AF129756 overlap ends at 106028 # AP000504 overlap ends at 100000 # No. of SNPs = 86 # No. of transitions = 58 # No. of transversions = 28
If no regions of alignment are found, the following output is given:
# Report of diffseq of: Fred and ECOMPA # No regions of alignment found.
The 'source' feature found in all EMBL/Genbank feature table entries is not reported as this covers all of the sequence and so overlaps with any difference found in that sequence and so is uninformative and irritating. It has therefore been removed from the output report.
The translation information of CDS features is often extremely long and does not add useful information to the report. It has therefore been removed from the output report.
If you run out of memory, use a larger word size.
Using a larger word size increases the length between mismatches that will be reported as one event. Thus a word size of 50 will report two SNP that are with 50 bases of each other as one mismatch.
Program name | Description |
---|
A graphical dotplot of the matches used in this program can be displayed using the program dotpath.