Class Ai4r::Data::DataSet
In: lib/ai4r/data/data_set.rb
Parent: Object

A data set is a collection of N data items. Each data item is described by a set of attributes, represented as an array. Optionally, you can assign a label to the attributes, using the data_labels property.

Methods

Attributes

data_items  [R] 
data_labels  [R] 

Public Class methods

Create a new DataSet. By default, empty. Optionaly, you can provide the initial data items and data labels.

e.g. DataSet.new(:data_items => data_items, :data_labels => labels)

If you provide data items, but no data labels, the data set will use the default data label values (see set_data_labels)

Public Instance methods

Add a data item to the data set

Retrieve a new DataSet, with the item(s) selected by the provided index. You can specify an index range, too.

Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).

  • Set instance containing all possible values for nominal attributes
  • Array with min and max values for numeric attributes (i.e. [min, max])

    build_domain("city")

> #<Set: {"New York", "Chicago"}>

build_domain("age")

> [5, 85]

build_domain(2) # In this example, the third attribute is gender

> #<Set: {"M", "F"}>

Returns an array with the domain of each attribute:

  • Set instance containing all possible values for nominal attributes
  • Array with min and max values for numeric attributes (i.e. [min, max])

Return example:

> [#<Set: {"New York", "Chicago"}>,

    #<Set: {"<30", "[30-50)", "[50-80]", ">80"}>,
    #<Set: {"M", "F"}>,
    [5, 85],
    #<Set: {"Y", "N"}>]

Raise an exception if there is no data item.

Returns the index of a given attribute (0-based). For example, if "gender" is the third attribute, then:

  get_index("gender")
  => 2

Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes

Load data items from csv file

Load data items from csv file. The first row is used as data labels.

Returns attributes number, including class attribute

opens a csv-file and reads it line by line for each line, a block is called and the row is passed to the block ruby1.8 and 1.9 safe

Same as load_csv, but it will try to convert cell contents as numbers.

Set the data items. M data items with N attributes must have the following format:

    [   [ATT1_VAL1, ATT2_VAL1, ATT3_VAL1, ... , ATTN_VAL1,  CLASS_VAL1],
        [ATT1_VAL2, ATT2_VAL2, ATT3_VAL2, ... , ATTN_VAL2,  CLASS_VAL2],
        ...
        [ATTM1_VALM, ATT2_VALM, ATT3_VALM, ... , ATTN_VALM, CLASS_VALM],
    ]

e.g.

    [   ['New York',  '<30',      'M', 'Y'],
         ['Chicago',     '<30',      'M', 'Y'],
         ['Chicago',     '<30',      'F', 'Y'],
         ['New York',  '<30',      'M', 'Y'],
         ['New York',  '<30',      'M', 'Y'],
         ['Chicago',     '[30-50)',  'M', 'Y'],
         ['New York',  '[30-50)',  'F', 'N'],
         ['Chicago',     '[30-50)',  'F', 'Y'],
         ['New York',  '[30-50)',  'F', 'N'],
         ['Chicago',     '[50-80]', 'M', 'N'],
         ['New York',  '[50-80]', 'F', 'N'],
         ['New York',  '[50-80]', 'M', 'N'],
         ['Chicago',     '[50-80]', 'M', 'N'],
         ['New York',  '[50-80]', 'F', 'N'],
         ['Chicago',     '>80',      'F', 'Y']
       ]

This method returns the classifier (self), allowing method chaining.

Set data labels. Data labels must have the following format:

    [ 'city', 'age_range', 'gender', 'marketing_target'  ]

If you do not provide labels for you data, the following labels will be created by default:

    [ 'attribute_1', 'attribute_2', 'attribute_3', 'class_value'  ]

Protected Instance methods

[Validate]