Class | Bio::Fastq |
In: |
lib/bio/db/fastq.rb
|
Parent: | Object |
Bio::Fastq is a parser for FASTQ format.
FormatNames | = | { "fastq-sanger" => FormatData::FASTQ_SANGER, "fastq-solexa" => FormatData::FASTQ_SOLEXA, "fastq-illumina" => FormatData::FASTQ_ILLUMINA | Available format names. | |
Formats | = | { :fastq_sanger => FormatData::FASTQ_SANGER, :fastq_solexa => FormatData::FASTQ_SOLEXA, :fastq_illumina => FormatData::FASTQ_ILLUMINA | Available format name symbols. | |
DefaultFormatName | = | 'fastq-sanger'.freeze | Default format name | |
FLATFILE_SPLITTER | = | Bio::FlatFile::Splitter::LineOriented | Splitter for Bio::FlatFile |
Creates a new Fastq object from formatted text string.
The format of quality scores should be specified later by using format= method.
Arguments:
# File lib/bio/db/fastq.rb, line 383 383: def initialize(str = nil) 384: return unless str 385: sc = StringScanner.new(str) 386: while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/) 387: unless add_header_line(line) then 388: sc.unscan 389: break 390: end 391: end 392: while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/) 393: unless add_line(line) then 394: sc.unscan 395: break 396: end 397: end 398: @entry_overrun = sc.rest 399: end
Adds a header line if the header data is not yet given and the given line is suitable for header. Returns self if adding header line is succeeded. Otherwise, returns false (the line is not added).
# File lib/bio/db/fastq.rb, line 324 324: def add_header_line(line) 325: @header ||= "" 326: if line[0,1] == "@" then 327: false 328: else 329: @header.concat line 330: self 331: end 332: end
Adds a line to the entry if the given line is regarded as a part of the current entry.
# File lib/bio/db/fastq.rb, line 339 339: def add_line(line) 340: line = line.chomp 341: if !defined? @definition then 342: if line[0, 1] == "@" then 343: @definition = line[1..-1] 344: else 345: @definition = line 346: @parse_errors ||= [] 347: @parse_errors.push Error::No_atmark.new 348: end 349: return self 350: end 351: if defined? @definition2 then 352: @quality_string ||= '' 353: if line[0, 1] == "@" and 354: @quality_string.size >= @sequence_string.size then 355: return false 356: else 357: @quality_string.concat line 358: return self 359: end 360: else 361: @sequence_string ||= '' 362: if line[0, 1] == '+' then 363: @definition2 = line[1..-1] 364: else 365: @sequence_string.concat line 366: end 367: return self 368: end 369: raise "Bug: should not reach here!" 370: end
Identifier of the entry. Normally, the first word of the ID line.
# File lib/bio/db/fastq.rb, line 446 446: def entry_id 447: unless defined? @entry_id then 448: eid = @definition.strip.split(/\s+/)[0] || @definition 449: @entry_id = eid 450: end 451: @entry_id 452: end
Estimated probability of error for each base.
Returns: | (Array containing Float) error probability values |
# File lib/bio/db/fastq.rb, line 529 529: def error_probabilities 530: unless defined? @error_probabilities then 531: self.format ||= self.class::DefaultFormatName 532: a = @format.q2p(self.quality_scores) 533: @error_probabilities = a 534: end 535: @error_probabilities 536: end
Specify the format. If the format is not found, raises RuntimeError.
Available formats are:
"fastq-sanger" or :fastq_sanger "fastq-solexa" or :fastq_solexa "fastq-illumina" or :fastq_illumina
Arguments:
Returns: | (String) format name |
# File lib/bio/db/fastq.rb, line 476 476: def format=(name) 477: if name then 478: f = FormatNames[name] || Formats[name] 479: if f then 480: reset_state 481: @format = f.instance 482: self.format 483: else 484: raise "unknown format" 485: end 486: else 487: reset_state 488: nil 489: end 490: end
Masks low quality sequence regions. For each sequence position, if the quality score is smaller than the threshold, the sequence in the position is replaced with mask_char.
Note: This method does not care quality_score_type.
Arguments:
Returns: | Bio::Sequence object |
# File lib/bio/db/fastq.rb, line 668 668: def mask(threshold, mask_char = 'n') 669: to_biosequence.mask_with_quality_score(threshold, mask_char) 670: end
returns Bio::Sequence::NA
# File lib/bio/db/fastq.rb, line 425 425: def naseq 426: unless defined? @naseq then 427: @naseq = Bio::Sequence::NA.new(@sequence_string) 428: end 429: @naseq 430: end
The meaning of the quality scores. It may be one of :phred, :solexa, or nil.
# File lib/bio/db/fastq.rb, line 504 504: def quality_score_type 505: self.format ||= self.class::DefaultFormatName 506: @format.quality_score_type 507: end
Quality score for each base. For "fastq-sanger" or "fastq-illumina", it is PHRED score. For "fastq-solexa", it is Solexa score.
Returns: | (Array containing Integer) quality score values |
# File lib/bio/db/fastq.rb, line 515 515: def quality_scores 516: unless defined? @quality_scores then 517: self.format ||= self.class::DefaultFormatName 518: s = @format.str2scores(@quality_string) 519: @quality_scores = s 520: end 521: @quality_scores 522: end
returns Bio::Sequence::Generic
# File lib/bio/db/fastq.rb, line 438 438: def seq 439: unless defined? @seq then 440: @seq = Bio::Sequence::Generic.new(@sequence_string) 441: end 442: @seq 443: end
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this Fastq object might also be changed (but not always be changed) because of efficiency.
# File lib/bio/db/fastq.rb, line 653 653: def to_biosequence 654: Bio::Sequence.adapter(self, Bio::Sequence::Adapter::Fastq) 655: end
Returns Fastq formatted string constructed from instance variables. The string will always be consisted of four lines without wrapping of the sequence and quality string, and the third-line is always only contains "+". This may be different from initial entry.
Note that use of the method may be inefficient and may lose performance because new string object is created every time it is called. For showing an entry as-is, consider using Bio::FlatFile#entry_raw. For output with various options, use Bio::Sequence#output(:fastq).
# File lib/bio/db/fastq.rb, line 420 420: def to_s 421: "@#{@definition}\n#{@sequence_string}\n+\n#{@quality_string}\n" 422: end
Format validation.
If an array is given as the argument, when errors are found, error objects are pushed to the array. Currently, following errors may be added to the array. (All errors are under the Bio::Fastq namespace, for example, Bio::Fastq::Error::Diff_ids).
Error::Diff_ids — the identifier in the two lines are different Error::Long_qual — length of quality is longer than the sequence Error::Short_qual — length of quality is shorter than the sequence Error::No_qual — no quality characters found Error::No_seq — no sequence found Error::Qual_char — invalid character in the quality Error::Seq_char — invalid character in the sequence Error::Qual_range — quality score value out of range Error::No_ids — sequence identifier not found Error::No_atmark — the first identifier does not begin with "@" Error::Skipped_unformatted_lines — the parser skipped unformatted lines that could not be recognized as FASTQ format
Arguments:
Returns: | true:no error, false: containing error. |
# File lib/bio/db/fastq.rb, line 562 562: def validate_format(errors = nil) 563: err = [] 564: 565: # if header exists, the format might be broken. 566: if defined? @header and @header and !@header.strip.empty? then 567: err.push Error::Skipped_unformatted_lines.new 568: end 569: 570: # if parse errors exist, adding them 571: if defined? @parse_errors and @parse_errors then 572: err.concat @parse_errors 573: end 574: 575: # check if identifier exists, and identifier matches 576: if !defined?(@definition) or !@definition then 577: err.push Error::No_ids.new 578: elsif defined?(@definition2) and 579: !@definition2.to_s.empty? and 580: @definition != @definition2 then 581: err.push Error::Diff_ids.new 582: end 583: 584: # check if sequence exists 585: has_seq = true 586: if !defined?(@sequence_string) or !@sequence_string then 587: err.push Error::No_seq.new 588: has_seq = false 589: end 590: 591: # check if quality exists 592: has_qual = true 593: if !defined?(@quality_string) or !@quality_string then 594: err.push Error::No_qual.new 595: has_qual = false 596: end 597: 598: # sequence and quality length check 599: if has_seq and has_qual then 600: slen = @sequence_string.length 601: qlen = @quality_string.length 602: if slen > qlen then 603: err.push Error::Short_qual.new 604: elsif qlen > slen then 605: err.push Error::Long_qual.new 606: end 607: end 608: 609: # sequence character check 610: if has_seq then 611: sc = StringScanner.new(@sequence_string) 612: while sc.scan_until(/[ \x00-\x1f\x7f-\xff]/n) 613: err.push Error::Seq_char.new(sc.pos - sc.matched_size) 614: end 615: end 616: 617: # sequence character check 618: if has_qual then 619: fmt = if defined?(@format) and @format then 620: @format.name 621: else 622: nil 623: end 624: re = case fmt 625: when 'fastq-sanger' 626: /[^\x21-\x7e]/n 627: when 'fastq-solexa' 628: /[^\x3b-\x7e]/n 629: when 'fastq-illumina' 630: /[^\x40-\x7e]/n 631: else 632: /[ \x00-\x1f\x7f-\xff]/n 633: end 634: sc = StringScanner.new(@quality_string) 635: while sc.scan_until(re) 636: err.push Error::Qual_char.new(sc.pos - sc.matched_size) 637: end 638: end 639: 640: # if "errors" is given, set errors 641: errors.concat err if errors 642: # returns true if no error; otherwise, returns false 643: err.empty? ? true : false 644: end