Filtering (filter)

Filters select subsets of instances. They are most typically used to select data instances from a table, for example to drop all instances that have no class value:

filtered = Orange.data.filter.HasClassValue(data)

Despite this typical use, filters operate on individual instances, not the entire data table: they can be called with an instance and return True are False to accept or reject the instances. Most examples below use them like this for sake of demonstration.

An alternative way to apply a filter is to call Orange.data.Table.filter on the data table.

All filters are derived from the base class Filter.

class Orange.data.filter.Filter

Abstract base class for filters.

negate

Inverts the selection. Defaults to False.

domain

Domain to which data instances are converted before checking.

__call__(instance)

Check whether the instance matches the filter’s criterion and return either True or False.

__call__(data)

Return a new data table containing the instances that match the criterion.

Filtering missing data

class Orange.data.filter.IsDefined

Selects instances for which all feature values are defined.

check

A list of bool‘s specifying which features to check. Each element corresponds to a feature in the domain. By default, check is None, meaning that all features are checked. The list is initialized to a list of True when the filter’s domain is set, unless the list already exists. The list can be indexed by ordinary integers (for example, check[0]); if domain is set, feature names or descriptors can also be used as indices.

data = Orange.data.Table("lenses")
data2 = data[:5]
data2[0]["age"] = "?"
data2[1].setclass("?")
print "First five intances"
for ex in data2:
    print ex

print "\nInstances without unknown values"
f = Orange.data.filter.IsDefined(domain = data.domain)
for ex in f(data2):
    print ex

print "\nInstances without unknown values, ignoring 'age'"
f.check["age"] = 0
for ex in f(data2):
    print ex

print "\nInstances with unknown values (ignoring age)"
for ex in f(data2, negate=1):
    print ex
class Orange.data.filter.HasClassValue

Selects instances with defined class value. Setting negate inverts the selection and chooses examples with unknown class.

data = Orange.data.Table("lenses")
print "\nInstances with defined class (HasClassValue)"
for ex in Orange.data.filter.HasClassValue(data2):
    print ex

print "\nInstances with undefined class (HasClassValue)"
for ex in Orange.data.filter.HasClassValue(data2, negate=1):
    print ex
class Orange.data.filter.HasMeta

Filters out instances that do not have a meta attribute with the given id.

id

The id of the meta attribute to look for.

This is filter is especially useful with instances from basket files, which have optional meta attributes. If they come, for example, from a text mining domain, we can use it to get the documents that contain a specific word:

data = Orange.data.Table("inquisition")
surprised = Orange.data.filter.HasMeta(data, id=data.domain.index("surprise"))

Random filter

class Orange.data.filter.Random

Accepts an instance with a given probability.

prob

Probability for accepting an instance.

random_generator

The random number generator used for making selections. If not set before filtering, a new generator is constructed and stored here for later use. If the attribute is set to an integer, Orange constructs a random generator and uses the integer as a seed.

randomfilter = Orange.data.filter.Random(prob = 0.7, randomGenerator = 24)
for i in range(10):
    print randomfilter(instance),

The output is:

1 0 0 0 1 1 0 1 0 1

Although the probability of selecting an instance is set to 0.7, the filter accepted five out of ten instances since the decision is made for each instance separately. To select exactly 70 % of instance (except for a rounding error), use SubsetIndices2.

Setting the random generator ensures that the filter will always select the same instances. Setting random_generator=24 is a shortcut for random_generator = Orange.misc.Random(initseed=24).

Filtering by single features

class Orange.data.filter.SameValue

Fast filter for selecting instances with particular value of a feature.

position

Index of feature in the Domain as returned by Orange.data.Domain.index.

value

Features’s value.

The following example selects instances with age=”young” from data set lenses:

filteryoung = Orange.data.filter.SameValue()
age = data.domain["age"]
filteryoung.value = Orange.data.Value(age, "young")
filteryoung.position = data.domain.features.index(age)
print "\nYoung instances"
for ex in filteryoung(data):
    print ex

data.domain.features behaves as a list and provides method index, which is used to retrieve the position of feature age. Feature age is also used to construct a Value.

Filtering by multiple features

Values filters by values of multiple features presented as subfilters derived from Orange.data.filter.ValueFilter.

class Orange.data.filter.Values
conditions

A list of conditions described by instances of classes derived from Orange.data.filter.ValueFilter.

conjunction

Indicates whether the filter computes conjunction or disjunction of conditions. If True, instance is accepted if no values are rejected. If False, instance is accepted if at least one value is accepted.

The attribute conditions contains subfilter instances of the following classes.

class Orange.data.filter.ValueFilter

The abstract base class for subfilters.

position

The position of the feature in the domain (as returned by, for instance, Orange.data.Domain.index).

accept_special

Determines whether undefined values are accepted (1), rejected (0) or ignored (-1, default).

class Orange.data.filter.ValueFilterDiscrete

Subfilter for values of discrete features.

values

An list of accepted values with elements of type Value.

class Orange.data.filter.ValueFilterContinous

Subfilter for values of continuous features.

min / ref

Lower bound of the interval (min and ref are aliases for the same attribute).

max

Upper bound of the interval.

oper

Comparison operator; should be one of the following: ValueFilter.Equal, ValueFilter.Less, ValueFilter.LessEqual, ValueFilter.Greater, ValueFilter.GreaterEqual, ValueFilter.Between, ValueFilter.Outside.

Attributes min and max define the interval for operators ValueFilter.Between and ValueFilter.Outside and ref (which is the same as min) for the others.

class Orange.data.filter.ValueFilterString

Subfilter for values of discrete features.

min / ref

Lower bound of the interval (min and ref are aliases for the same attribute.

max

Upper bound of the interval.

oper

Comparison operator; should be one of the following: ValueFilter.Equal, ValueFilter.Less, ValueFilter.LessEqual, ValueFilter.Greater, ValueFilter.GreaterEqual, ValueFilter.Between, ValueFilter.Outside, Contains, NotContains, BeginsWith, EndsWith.

case_sensitive

Tells whether the comparisons are case sensitive. Default is True.

Attributes min and max define the interval for operators ValueFilter.Between and ValueFilter.Outside and ref (which is the same as min) for the others.

class Orange.data.filter.ValueFilterStringList

Accepts string values from the list.

values

A list of accepted strings.

case_sensitive

Tells whether the comparisons are case sensitive. Default is True.

The following script selects instances whose age is “young” or “presbyopic” and which are “astigmatic”. Unknown values are ignored. If value for one of the two features is missing, only the other is checked. If both are missing, instance is accepted.

fya = Orange.data.filter.Values()
age, astigm = data.domain["age"], data.domain["astigmatic"]
fya.domain = data.domain
fya.conditions.append(
    Orange.data.filter.ValueFilterDiscrete(
        position=data.domain.features.index(age),
        values=[Orange.data.Value(age, "young"),
                Orange.data.Value(age, "presbyopic")])
)
fya.conditions.append(
    Orange.data.filter.ValueFilterDiscrete(
        position = data.domain.features.index(astigm),
        values=[Orange.data.Value(astigm, "yes")]))
for ex in fya(data):
    print ex

The filter is first constructed and assigned a domain. Then both conditions are appended to the filter’s conditions field. Both are of type ValueFilterDiscrete, since the two attributes are discrete. Position of the attribute is obtained the same way as for SameValue described above.

The list of conditions can also be given to a filter constructor. The following filter will accept examples whose age is “young” or “presbyopic” or who are astigmatic (conjunction = 0). For contrast from above filter, unknown age is not acceptable (but examples with unknown age can still be accepted if they are astigmatic). Meanwhile, examples with unknown astigmatism are always accepted.

fya = Orange.data.filter.Values(domain=data.domain, conditions=
    [
    Orange.data.filter.ValueFilterDiscrete(
        position=data.domain.features.index(age),
        values=[Orange.data.Value(age, "young"),
                Orange.data.Value(age, "presbyopic")
                ], acceptSpecial = 1),
    Orange.data.filter.ValueFilterDiscrete(
        position=data.domain.features.index(astigm),
        values=[Orange.data.Value(astigm, "yes")])
    ],
    conjunction = 0
)

Composition of filters

Filters can be combined into conjuctions or disjunctions using the following descendants of Filter. It is possible to build hierarchies of filters (e.g. disjunction of conjuctions).

class Orange.data.filter.FilterConjunction

Reject the instance if any of the combined filters rejects it. Conjunction can be negated using the inherited :obj:~Filter.negate flag.

filters

A list of filters (instances of Filter)

class Orange.data.filter.FilterDisjunction

Accept the instance if any of the combined filters accepts it. Disjunction can be negated using the inherited :obj:~Filter.negate flag.

filters

A list of filters (instances of Filter)