Filtering (filter)¶
Filters select subsets of instances. They are most typically used to select data instances from a table, for example to drop all instances that have no class value:
filtered = Orange.data.filter.HasClassValue(data)
Despite this typical use, filters operate on individual instances, not the entire data table: they can be called with an instance and return True are False to accept or reject the instances. Most examples below use them like this for sake of demonstration.
An alternative way to apply a filter is to call Orange.data.Table.filter on the data table.
All filters are derived from the base class Filter.
- class Orange.data.filter.Filter¶
Abstract base class for filters.
- domain¶
Domain to which data instances are converted before checking.
- __call__(instance)¶
Check whether the instance matches the filter’s criterion and return either True or False.
- __call__(data)
Return a new data table containing the instances that match the criterion.
Filtering missing data¶
- class Orange.data.filter.IsDefined¶
Selects instances for which all feature values are defined.
- check¶
A list of bool‘s specifying which features to check. Each element corresponds to a feature in the domain. By default, check is None, meaning that all features are checked. The list is initialized to a list of True when the filter’s domain is set, unless the list already exists. The list can be indexed by ordinary integers (for example, check[0]); if domain is set, feature names or descriptors can also be used as indices.
data = Orange.data.Table("lenses")
data2 = data[:5]
data2[0]["age"] = "?"
data2[1].setclass("?")
print "First five intances"
for ex in data2:
print ex
print "\nInstances without unknown values"
f = Orange.data.filter.IsDefined(domain = data.domain)
for ex in f(data2):
print ex
print "\nInstances without unknown values, ignoring 'age'"
f.check["age"] = 0
for ex in f(data2):
print ex
print "\nInstances with unknown values (ignoring age)"
for ex in f(data2, negate=1):
print ex
- class Orange.data.filter.HasClassValue¶
Selects instances with defined class value. Setting negate inverts the selection and chooses examples with unknown class.
data = Orange.data.Table("lenses")
print "\nInstances with defined class (HasClassValue)"
for ex in Orange.data.filter.HasClassValue(data2):
print ex
print "\nInstances with undefined class (HasClassValue)"
for ex in Orange.data.filter.HasClassValue(data2, negate=1):
print ex
- class Orange.data.filter.HasMeta¶
Filters out instances that do not have a meta attribute with the given id.
- id¶
The id of the meta attribute to look for.
This is filter is especially useful with instances from basket files, which have optional meta attributes. If they come, for example, from a text mining domain, we can use it to get the documents that contain a specific word:
data = Orange.data.Table("inquisition")
surprised = Orange.data.filter.HasMeta(data, id=data.domain.index("surprise"))
Random filter¶
- class Orange.data.filter.Random¶
Accepts an instance with a given probability.
- prob¶
Probability for accepting an instance.
- random_generator¶
The random number generator used for making selections. If not set before filtering, a new generator is constructed and stored here for later use. If the attribute is set to an integer, Orange constructs a random generator and uses the integer as a seed.
randomfilter = Orange.data.filter.Random(prob = 0.7, randomGenerator = 24)
for i in range(10):
print randomfilter(instance),
The output is:
1 0 0 0 1 1 0 1 0 1
Although the probability of selecting an instance is set to 0.7, the filter accepted five out of ten instances since the decision is made for each instance separately. To select exactly 70 % of instance (except for a rounding error), use SubsetIndices2.
Setting the random generator ensures that the filter will always select the same instances. Setting random_generator=24 is a shortcut for random_generator = Orange.misc.Random(initseed=24).
Filtering by single features¶
- class Orange.data.filter.SameValue¶
Fast filter for selecting instances with particular value of a feature.
- value¶
Features’s value.
The following example selects instances with age=”young” from data set lenses:
filteryoung = Orange.data.filter.SameValue()
age = data.domain["age"]
filteryoung.value = Orange.data.Value(age, "young")
filteryoung.position = data.domain.features.index(age)
print "\nYoung instances"
for ex in filteryoung(data):
print ex
data.domain.features behaves as a list and provides method index, which is used to retrieve the position of feature age. Feature age is also used to construct a Value.
Filtering by multiple features¶
Values filters by values of multiple features presented as subfilters derived from Orange.data.filter.ValueFilter.
- class Orange.data.filter.Values¶
- conditions¶
A list of conditions described by instances of classes derived from Orange.data.filter.ValueFilter.
- conjunction¶
Indicates whether the filter computes conjunction or disjunction of conditions. If True, instance is accepted if no values are rejected. If False, instance is accepted if at least one value is accepted.
The attribute conditions contains subfilter instances of the following classes.
- class Orange.data.filter.ValueFilter¶
The abstract base class for subfilters.
- position¶
The position of the feature in the domain (as returned by, for instance, Orange.data.Domain.index).
- accept_special¶
Determines whether undefined values are accepted (1), rejected (0) or ignored (-1, default).
- class Orange.data.filter.ValueFilterDiscrete¶
Subfilter for values of discrete features.
- class Orange.data.filter.ValueFilterContinous¶
Subfilter for values of continuous features.
- min / ref
Lower bound of the interval (min and ref are aliases for the same attribute).
- max¶
Upper bound of the interval.
- oper¶
Comparison operator; should be one of the following: ValueFilter.Equal, ValueFilter.Less, ValueFilter.LessEqual, ValueFilter.Greater, ValueFilter.GreaterEqual, ValueFilter.Between, ValueFilter.Outside.
Attributes min and max define the interval for operators ValueFilter.Between and ValueFilter.Outside and ref (which is the same as min) for the others.
- class Orange.data.filter.ValueFilterString¶
Subfilter for values of discrete features.
- min / ref
Lower bound of the interval (min and ref are aliases for the same attribute.
- max¶
Upper bound of the interval.
- oper¶
Comparison operator; should be one of the following: ValueFilter.Equal, ValueFilter.Less, ValueFilter.LessEqual, ValueFilter.Greater, ValueFilter.GreaterEqual, ValueFilter.Between, ValueFilter.Outside, Contains, NotContains, BeginsWith, EndsWith.
- case_sensitive¶
Tells whether the comparisons are case sensitive. Default is True.
Attributes min and max define the interval for operators ValueFilter.Between and ValueFilter.Outside and ref (which is the same as min) for the others.
- class Orange.data.filter.ValueFilterStringList¶
Accepts string values from the list.
- values¶
A list of accepted strings.
- case_sensitive¶
Tells whether the comparisons are case sensitive. Default is True.
The following script selects instances whose age is “young” or “presbyopic” and which are “astigmatic”. Unknown values are ignored. If value for one of the two features is missing, only the other is checked. If both are missing, instance is accepted.
fya = Orange.data.filter.Values()
age, astigm = data.domain["age"], data.domain["astigmatic"]
fya.domain = data.domain
fya.conditions.append(
Orange.data.filter.ValueFilterDiscrete(
position=data.domain.features.index(age),
values=[Orange.data.Value(age, "young"),
Orange.data.Value(age, "presbyopic")])
)
fya.conditions.append(
Orange.data.filter.ValueFilterDiscrete(
position = data.domain.features.index(astigm),
values=[Orange.data.Value(astigm, "yes")]))
for ex in fya(data):
print ex
The filter is first constructed and assigned a domain. Then both conditions are appended to the filter’s conditions field. Both are of type ValueFilterDiscrete, since the two attributes are discrete. Position of the attribute is obtained the same way as for SameValue described above.
The list of conditions can also be given to a filter constructor. The following filter will accept examples whose age is “young” or “presbyopic” or who are astigmatic (conjunction = 0). For contrast from above filter, unknown age is not acceptable (but examples with unknown age can still be accepted if they are astigmatic). Meanwhile, examples with unknown astigmatism are always accepted.
fya = Orange.data.filter.Values(domain=data.domain, conditions=
[
Orange.data.filter.ValueFilterDiscrete(
position=data.domain.features.index(age),
values=[Orange.data.Value(age, "young"),
Orange.data.Value(age, "presbyopic")
], acceptSpecial = 1),
Orange.data.filter.ValueFilterDiscrete(
position=data.domain.features.index(astigm),
values=[Orange.data.Value(astigm, "yes")])
],
conjunction = 0
)
Composition of filters¶
Filters can be combined into conjuctions or disjunctions using the following descendants of Filter. It is possible to build hierarchies of filters (e.g. disjunction of conjuctions).
- class Orange.data.filter.FilterConjunction¶
Reject the instance if any of the combined filters rejects it. Conjunction can be negated using the inherited :obj:~Filter.negate flag.