Class Bio::KEGG::Taxonomy
In: lib/bio/db/kegg/taxonomy.rb
Parent: Object

Description

Parse the KEGG ‘taxonomy’ file which describes taxonomic classification of organisms.

References

The KEGG ‘taxonomy’ file is available at

Methods

Attributes

leaves  [R] 
path  [R] 
root  [RW] 
tree  [R] 

Public Class methods

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 26
 26:   def initialize(filename, orgs = [])
 27:     # Stores the taxonomic tree as a linked list (implemented in Hash), so
 28:     # every node need to have unique name (key) to work correctly
 29:     @tree = Hash.new
 30: 
 31:     # Also stores the taxonomic tree as a list of arrays (full path)
 32:     @path = Array.new
 33: 
 34:     # Also stores all leaf nodes (organism codes) of every intermediate nodes
 35:     @leaves = Hash.new
 36: 
 37:     # tentative name for the root node (use accessor to change)
 38:     @root = 'Genes'
 39: 
 40:     hier = Array.new
 41:     level = 0
 42:     label = nil
 43: 
 44:     File.open(filename).each do |line|
 45:       next if line.strip.empty?
 46: 
 47:       # line for taxonomic hierarchy (indent according to the number of # marks)
 48:       if line[/^#/]
 49:         level = line[/^#+/].length
 50:         label = line[/[A-z].*/]
 51:         hier[level] = sanitize(label)
 52: 
 53:       # line for organims name (unify different strains of a species)
 54:       else
 55:         tax, org, name, desc = line.chomp.split("\t")
 56:         if orgs.nil? or orgs.empty? or orgs.include?(org)
 57:           species, strain, = name.split('_')
 58:           # (0) Grouping of the strains of the same species.
 59:           #  If the name of species is the same as the previous line,
 60:           #  add the species to the same species group.
 61:           #   ex. Gamma/enterobacteria has a large number of organisms,
 62:           #       so sub grouping of strains is needed for E.coli strains etc.
 63:           #
 64:           # However, if the species name is already used, need to avoid
 65:           # collision of species name as the current implementation stores
 66:           # the tree as a Hash, which may cause the infinite loop.
 67:           #
 68:           # (1) If species name == the intermediate node of other lineage
 69:           #  Add '_sp' to the species name to avoid the conflict (1-1), and if
 70:           #  'species_sp' is already taken, use 'species_strain' instead (1-2).
 71:           #   ex. Bacteria/Proteobacteria/Beta/T.denitrificans/tbd
 72:           #       Bacteria/Proteobacteria/Epsilon/T.denitrificans_ATCC33889/tdn
 73:           #    -> Bacteria/Proteobacteria/Beta/T.denitrificans/tbd
 74:           #       Bacteria/Proteobacteria/Epsilon/T.denitrificans_sp/tdn
 75:           #
 76:           # (2) If species name == the intermediate node of the same lineage
 77:           #  Add '_sp' to the species name to avoid the conflict.
 78:           #   ex. Bacteria/Cyanobacgteria/Cyanobacteria_CYA/cya
 79:           #       Bacteria/Cyanobacgteria/Cyanobacteria_CYB/cya
 80:           #       Bacteria/Proteobacteria/Magnetococcus/Magnetococcus_MC1/mgm
 81:           #    -> Bacteria/Cyanobacgteria/Cyanobacteria_sp/cya
 82:           #       Bacteria/Cyanobacgteria/Cyanobacteria_sp/cya
 83:           #       Bacteria/Proteobacteria/Magnetococcus/Magnetococcus_sp/mgm
 84:           sp_group = "#{species}_sp"
 85:           if @tree[species]
 86:             if hier[level+1] == species
 87:               # case (0)
 88:             else
 89:               # case (1-1)
 90:               species = sp_group
 91:               # case (1-2)
 92:               if @tree[sp_group] and hier[level+1] != species
 93:                 species = name
 94:               end
 95:             end
 96:           else
 97:             if hier[level] == species
 98:               # case (2)
 99:               species = sp_group
100:             end
101:           end
102:           # 'hier' is an array of the taxonomic tree + species and strain name.
103:           #  ex. [nil, Eukaryotes, Fungi, Ascomycetes, Saccharomycetes] +
104:           #      [S_cerevisiae, sce]
105:           hier[level+1] = species       # sanitize(species)
106:           hier[level+2] = org
107:           ary = hier[1, level+2]
108:           warn ary.inspect if $DEBUG
109:           add_to_tree(ary)
110:           add_to_leaves(ary)
111:           add_to_path(ary)
112:         end
113:       end
114:     end
115:     return tree
116:   end

Public Instance methods

Add a new path [node, subnode, subsubnode, …, leaf] under the root node and stores leaf nodes to the every intermediate nodes as an Array.

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 140
140:   def add_to_leaves(ary)
141:     leaf = ary.last
142:     ary.each do |node|
143:       @leaves[node] ||= Array.new
144:       @leaves[node] << leaf
145:     end
146:   end

Add a new path [node, subnode, subsubnode, …, leaf] under the root node and stores the path itself in an Array.

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 150
150:   def add_to_path(ary)
151:     @path << ary
152:   end

Add a new path [node, subnode, subsubnode, …, leaf] under the root node and every intermediate nodes stores their child nodes as a Hash.

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 129
129:   def add_to_tree(ary)
130:     parent = @root
131:     ary.each do |node|
132:       @tree[parent] ||= Hash.new
133:       @tree[parent][node] = nil
134:       parent = node
135:     end
136:   end

Compaction of intermediate nodes of the resulted taxonomic tree.

 - If child node has only one child node (grandchild), make the child of
   grandchild as a grandchild.
 ex.
   Plants / Monocotyledons / grass family / osa
   --> Plants / Monocotyledons / osa

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 161
161:   def compact(node = root)
162:     # if the node has children
163:     if subnodes = @tree[node]
164:       # obtain grandchildren for each child
165:       subnodes.keys.each do |subnode|
166:         if subsubnodes = @tree[subnode]
167:           # if the number of grandchild node is 1
168:           if subsubnodes.keys.size == 1
169:             # obtain the name of the grandchild node
170:             subsubnode = subsubnodes.keys.first
171:             # obtain the child of the grandchlid node
172:             if subsubsubnodes = @tree[subsubnode]
173:               # make the child of grandchild node as a chlid of child node
174:               @tree[subnode] = subsubsubnodes
175:               # delete grandchild node
176:               @tree[subnode].delete(subsubnode)
177:               warn "--- compact: #{subsubnode} is replaced by #{subsubsubnodes}" if $DEBUG
178:               # retry until new grandchild also needed to be compacted.
179:               retry
180:             end
181:           end
182:         end
183:         # repeat recurseively
184:         compact(subnode)
185:       end
186:     end
187:   end

Traverse the taxonomic tree by the depth first search method under the given (root or intermediate) node.

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 224
224:   def dfs(parent, &block)
225:     if children = @tree[parent]
226:       yield parent, children
227:       children.keys.each do |child|
228:         dfs(child, &block)
229:       end
230:     end
231:   end

Similar to the dfs method but also passes the current level of the nest to the iterator.

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 235
235:   def dfs_with_level(parent, &block)
236:     @level ||= 0
237:     if children = @tree[parent]
238:       yield parent, children, @level
239:       @level += 1
240:       children.keys.each do |child|
241:         dfs_with_level(child, &block)
242:       end
243:       @level -= 1
244:     end
245:   end

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 123
123:   def organisms(group)
124:     @leaves[group]
125:   end

Reduction of the leaf node of the resulted taxonomic tree.

 - If the parent node have only one leaf node, replace parent node
   with the leaf node.
 ex.
  Plants / Monocotyledons / osa
  --> Plants / osa

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 196
196:   def reduce(node = root)
197:     # if the node has children
198:     if subnodes = @tree[node]
199:       # obtain grandchildren for each child
200:       subnodes.keys.each do |subnode|
201:         if subsubnodes = @tree[subnode]
202:           # if the number of grandchild node is 1
203:           if subsubnodes.keys.size == 1
204:             # obtain the name of the grandchild node
205:             subsubnode = subsubnodes.keys.first
206:             # if the grandchild node is a leaf node
207:             unless @tree[subsubnode]
208:               # make the grandchild node as a child node
209:               @tree[node].update(subsubnodes)
210:               # delete child node
211:               @tree[node].delete(subnode)
212:               warn "--- reduce: #{subnode} is replaced by #{subsubnode}" if $DEBUG
213:             end
214:           end
215:         end
216:         # repeat recursively
217:         reduce(subnode)
218:       end
219:     end
220:   end

Convert the taxonomic tree structure to a simple ascii art.

[Source]

     # File lib/bio/db/kegg/taxonomy.rb, line 248
248:   def to_s
249:     result = "#{@root}\n"
250:     @tree[@root].keys.each do |node|
251:       result += ascii_tree(node, "  ")
252:     end
253:     return result
254:   end

[Validate]