Loading and saving data¶
Tab-delimited format¶
Orange prefers to open data files in its native, tab-delimited format. This format allows us to specify type of features and optional flags along with the feature names, which can ofter result in shorter loading times. This additional data is provided in a form of a 3-line header. First line contains variable names, followed by their types in the second line and optional flags in the third.
Example of iris dataset in tab-delimited format (iris.tab)
sepal length sepal width petal length petal width iris
c c c c d
class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
Feature types¶
- discrete (or d) - imported as Orange.feature.Discrete
- continuous (or c) - imported as Orange.feature.Continuous
- text (or string, s) - imported as Orange.feature.String
- basket - used for storing sparse data. More on basket formats in a dedicated section.
Optional flags¶
- ignore (or i) - feature will not be imported
- class (or c) - feature will be imported as class variable. Only one feature can be marked as class.
- multiclass - feature is one of multiple classes. Data can have both, multiple classes and an ordinary class.
- meta (or m) - feature will be imported as a meta attribute.
Baskets¶
Baskets can be used for storing sparse data in tab delimited files. They were specifically designed for text mining needs. If text mining and sparse data is not your business, you can skip this section.
Baskets are given as a list of space-separated <name>=<value> atoms. A continuous meta attribute named <name> will be created and added to the domain as optional if it is not already there. A meta value for that variable will be added to the example. If the value is 1, you can omit the =<value> part.
It is not possible to put meta attributes of other types than continuous in the basket.
A tab delimited file with a basket can look like this:
K Ca b_foo Ba y
c c basket c c
meta i class
0.06 8.75 a b a c 0 1
0.48 b=2 d 0 1
0.39 7.78 0 1
0.57 8.22 c=13 0 1
These are the examples read from such a file:
[0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
[0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
[0.39, 1], {"Ca":7.78}
[0.57, 1], {"Ca":8.22, "c":13.000}
It is recommended to have the basket as the last column, especially if it contains a lot of data.
Note a few things. The basket column’s name, b_foo, is not used. In the first example, the value of a is 2 since it appears twice. The ordinary meta attribute, Ca, appears in all examples, even in those where its value is undefined. Meta attributes from the basket appear only where they are defined. This is due to the different nature of these meta attributes: Ca is required while the others are optional.
>>> d.domain.getmetas()
{-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
>>> d.domain.getmetas(False)
{-22: FloatVariable 'Ca'}
>>> d.domain.getmetas(True)
{-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
To fully understand all this, you should read the documentation on meta attributes in Domain and on the basket file format (a simple format that is limited to baskets only).
Basket Format¶
Basket files (.basket) are suitable for representing sparse data. Each example is represented by a line in the file. The line is written as a comma-separated list of name-value pairs. Here’s an example of such file.
nobody, expects, the, Spanish, Inquisition=5
our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
to, the, Pope, and, nice, red, uniforms, oh damn
The file contains four examples. The first examples has five attributes defined, “nobody”, “expects”, “the”, “Spanish” and “Inquisition”; the first four have (the default) value of 1.0 and the last has a value of 5.0.
The attributes that appear in the domain aren’t defined in any headers or even separate files, as with other formats supported by Orange.
If attribute appears more than once, its values are added. For instance, the value of attribute “surprise” in the second examples is 6.0 and the value of “fear” is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0, and the latter appears twice with value of 1.0.
All attributes are loaded as optional meta-attributes, so zero values don’t take any memory (unless they are given, but initialized to zero). See also section on meta attributes in the reference for domain descriptors.
Notice that at the time of writing this reference only association rules can directly use examples presented in the basket format.
Other supported data formats¶
Orange can import data from csv or tab delimited files where the first line contains attribute names followed by lines containing data. For such files, orange tries to guess the type of features and treats the right-most column as the class variable. If feature types are known in advance, special orange tab format should be used.