Class Bio::FastaFormat
In: lib/bio/db/fasta.rb
Parent: DB

Treats a FASTA formatted entry, such as:

  >id and/or some comments                    <== comment line
  ATGCATGCATGCATGCATGCATGCATGCATGCATGC        <== sequence lines
  ATGCATGCATGCATGCATGCATGCATGCATGCATGC
  ATGCATGCATGC

The precedent ’>’ can be omitted and the trailing ’>’ will be removed automatically.

Examples

  f_str = <<END_OF_STRING
  >sce:YBR160W  CDC28, SRM5; cyclin-dependent protein kinase catalytic subunit [EC:2.7.1.-] [SP:CC28_YEAST]
  MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEG
  VPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYME
  GIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNL
  KLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGC
  IFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFP
  QWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES
  >sce:YBR274W  CHK1; probable serine/threonine-protein kinase [EC:2.7.1.-] [SP:KB9S_YEAST]
  MSLSQVSPLPHIKDVVLGDTVGQGAFACVKNAHLQMDPSIILAVKFIHVP
  TCKKMGLSDKDITKEVVLQSKCSKHPNVLRLIDCNVSKEYMWIILEMADG
  GDLFDKIEPDVGVDSDVAQFYFQQLVSAINYLHVECGVAHRDIKPENILL
  DKNGNLKLADFGLASQFRRKDGTLRVSMDQRGSPPYMAPEVLYSEEGYYA
  DRTDIWSIGILLFVLLTGQTPWELPSLENEDFVFFIENDGNLNWGPWSKI
  EFTHLNLLRKILQPDPNKRVTLKALKLHPWVLRRASFSGDDGLCNDPELL
  AKKLFSHLKVSLSNENYLKFTQDTNSNNRYISTQPIGNELAELEHDSMHF
  QTVSNTQRAFTSYDSNTNYNSGTGMTQEAKWTQFISYDIAALQFHSDEND
  CNELVKRHLQFNPNKLTKFYTLQPMDVLLPILEKALNLSQIRVKPDLFAN
  FERLCELLGYDNVFPLIINIKTKSNGGYQLCGSISIIKIEEELKSVGFER
  KTGDPLEWRRLFKKISTICRDIILIPN
  END_OF_STRING

  f = Bio::FastaFormat.new(f_str)
  puts "### FastaFormat"
  puts "# entry"
  puts f.entry
  puts "# entry_id"
  p f.entry_id
  puts "# definition"
  p f.definition
  puts "# data"
  p f.data
  puts "# seq"
  p f.seq
  puts "# seq.type"
  p f.seq.type
  puts "# length"
  p f.length
  puts "# aaseq"
  p f.aaseq
  puts "# aaseq.type"
  p f.aaseq.type
  puts "# aaseq.composition"
  p f.aaseq.composition
  puts "# aalen"
  p f.aalen

References

Methods

aalen   aaseq   acc_version   accession   accessions   blast   comment   entry   entry_id   fasta   gi   identifiers   length   locus   nalen   naseq   new   query   seq   to_biosequence   to_s   to_seq  

Constants

DELIMITER = RS = "\n>"   Entry delimiter in flatfile text.
DELIMITER_OVERRUN = 1   (Integer) excess read size included in DELIMITER.

Attributes

data  [RW]  The seuqnce lines in text.
definition  [RW]  The comment line of the FASTA formatted data.
entry_overrun  [R] 

Public Class methods

Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.

[Source]

     # File lib/bio/db/fasta.rb, line 119
119:     def initialize(str)
120:       @definition = str[/.*/].sub(/^>/, '').strip       # 1st line
121:       @data = str.sub(/.*/, '')                         # rests
122:       @data.sub!(/^>.*/m, '')   # remove trailing entries for sure
123:       @entry_overrun = $&
124:     end

Public Instance methods

Returens the length of Bio::Sequence::AA.

[Source]

     # File lib/bio/db/fasta.rb, line 209
209:     def aalen
210:       self.aaseq.length
211:     end

Returens the Bio::Sequence::AA.

[Source]

     # File lib/bio/db/fasta.rb, line 204
204:     def aaseq
205:       Sequence::AA.new(seq)
206:     end

Returns accession number with version.

[Source]

     # File lib/bio/db/fasta.rb, line 265
265:     def acc_version
266:       identifiers.acc_version
267:     end

Returns an accession number.

[Source]

     # File lib/bio/db/fasta.rb, line 253
253:     def accession
254:       identifiers.accession
255:     end

Parsing FASTA Defline (using identifiers method), and shows accession numbers. It returns an array of strings.

[Source]

     # File lib/bio/db/fasta.rb, line 260
260:     def accessions
261:       identifiers.accessions
262:     end
blast(factory)

Alias for query

Returns comments.

[Source]

     # File lib/bio/db/fasta.rb, line 183
183:     def comment
184:       seq
185:       @comment
186:     end

Returns the stored one entry as a FASTA format. (same as to_s)

[Source]

     # File lib/bio/db/fasta.rb, line 127
127:     def entry
128:       @entry = ">#{@definition}\n#{@data.strip}\n"
129:     end

Parsing FASTA Defline (using identifiers method), and shows a possibly unique identifier. It returns a string.

[Source]

     # File lib/bio/db/fasta.rb, line 239
239:     def entry_id
240:       identifiers.entry_id
241:     end
fasta(factory)

Alias for query

Parsing FASTA Defline (using identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.

[Source]

     # File lib/bio/db/fasta.rb, line 248
248:     def gi
249:       identifiers.gi
250:     end

Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or ":"-separated IDs. It returns a Bio::FastaDefline instance.

[Source]

     # File lib/bio/db/fasta.rb, line 229
229:     def identifiers
230:       unless defined?(@ids) then
231:         @ids = FastaDefline.new(@definition)
232:       end
233:       @ids
234:     end

Returns sequence length.

[Source]

     # File lib/bio/db/fasta.rb, line 189
189:     def length
190:       seq.length
191:     end

Returns locus.

[Source]

     # File lib/bio/db/fasta.rb, line 270
270:     def locus
271:       identifiers.locus
272:     end

Returens the length of Bio::Sequence::NA.

[Source]

     # File lib/bio/db/fasta.rb, line 199
199:     def nalen
200:       self.naseq.length
201:     end

Returens the Bio::Sequence::NA.

[Source]

     # File lib/bio/db/fasta.rb, line 194
194:     def naseq
195:       Sequence::NA.new(seq)
196:     end

Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.

  #!/usr/bin/env ruby
  require 'bio'

  factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
  flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
  flatfile.each do |entry|
    p entry.definition
    result = entry.fasta(factory)
    result.each do |hit|
      print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
      p hit.lap_at
    end
  end

[Source]

     # File lib/bio/db/fasta.rb, line 150
150:     def query(factory)
151:       factory.query(entry)
152:     end

Returns a joined sequence line as a String.

[Source]

     # File lib/bio/db/fasta.rb, line 157
157:     def seq
158:       unless defined?(@seq)
159:         unless /\A\s*^\#/ =~ @data then
160:           @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up
161:         else
162:           a = @data.split(/(^\#.*$)/)
163:           i = 0
164:           cmnt = {}
165:           s = []
166:           a.each do |x|
167:             if /^# ?(.*)$/ =~ x then
168:               cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1
169:             else
170:               x.tr!(" \t\r\n0-9", '') # lazy clean up
171:               i += x.length
172:               s << x
173:             end
174:           end
175:           @comment = cmnt
176:           @seq = Bio::Sequence::Generic.new(s.join(''))
177:         end
178:       end
179:       @seq
180:     end

Returns sequence as a Bio::Sequence object.

Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.

[Source]

     # File lib/bio/db/fasta.rb, line 220
220:     def to_biosequence
221:       Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat)
222:     end
to_s()

Alias for entry

to_seq()

Alias for to_biosequence

[Validate]