Selection (selection)

Feature selection module contains several utility functions for selecting features based on they scores normally obtained in classification or regression problems. A typical example is the function select that returns a subsets of highest-scored features features:

import Orange
voting = Orange.data.Table("voting")

n = 3
ma = Orange.feature.scoring.score_all(voting)
best = Orange.feature.selection.top_rated(ma, n)
print 'Best %d features:' % n
for s in best:
    print s

The script outputs:

Best 3 features:
physician-fee-freeze
el-salvador-aid
synfuels-corporation-cutback

The module also includes a learner that incorporates feature subset selection.

New in version 2.7.1: select, select_above_threshold and select_relief now preserve the domain’s meta attributes and class_vars.

Functions for feature subset selection

static selection.top_rated(scores, n, highest_best=True)

Return n top-rated features from the list of scores.

Parameters:
  • scores (list) – A list such as the one returned by score_all()
  • n (int) – Number of features to select.
  • highest_best (bool) – If true, the features that are scored higher are preferred.
Return type:

list

static selection.above_threshold(scores, threshold=0.0)

Return features (without scores) with scores above or equal to a specified threshold.

Parameters:
  • scores (list) – A list such as one returned by score_all()
  • threshold (float) – Threshold for selection.
Return type:

list

static selection.select(data, scores, n)

Construct and return a new data table that includes a class and only the best features from a list scores.

Parameters:
Return type:

Orange.data.Table

static selection.select_above_threshold(data, scores, threshold=0.0)

Construct and return a new data table that includes a class and features from the list returned by score_all with higher or equal score to a given threshold.

Parameters:
Return type:

Orange.data.Table

static selection.select_relief(data, measure=Orange.feature.scoring.Relief(k=20, m=10), margin=0)

Iteratively remove the worst scored feature until no feature has a score below the margin. The filter procedure was originally designed for measures such as Relief, which are context dependent, i.e., removal of features may change the scores of other remaining features. The score is thus recomputed in each iteration.

Parameters:

Learning with feature subset selection

class Orange.feature.selection.FilteredLearner(base_learner, filter=FilterAboveThreshold(), name=filtered)
A feature selection wrapper around base learner. When provided data,
this learner applies a given feature selection method and then calls the base learner.

Here is an example of how to build a wrapper around naive Bayesian learner and use it on a data set:

nb = Orange.classification.bayes.NaiveBayesLearner()
learner = Orange.feature.selection.FilteredLearner(nb,
    filter=Orange.feature.selection.FilterBestN(n=5), name='filtered')
classifier = learner(data)
class Orange.feature.selection.FilteredClassifier(**kwds)

A classifier returned by FilteredLearner.

Class wrappers for selection functions

class Orange.feature.selection.FilterAboveThreshold(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), threshold=0.0)

A wrapper around select_above_threshold; the constructor stores the parameters of the feature selection procedure that are then applied when the the selection is called with the actual data.

Parameters:
__call__(data)

Return data table features that have scores above given threshold.

Parameters:data (Orange.data.Table) – data table

Below are few examples of utility of this class:

>>> filter = Orange.feature.selection.FilterAboveThreshold(threshold=.15)
>>> new_data = filter(data)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1, \
    measure=Orange.feature.scoring.Gini())
class Orange.feature.selection.FilterBestN(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), n=5)

A wrapper around select; the constructor stores the filter parameters that are applied when the function is called.

Parameters:
class Orange.feature.selection.FilterRelief(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), margin=0)

A class wrapper around select_best_n; the constructor stores the filter parameters that are applied when the function is called.

Parameters:

Examples

The following script defines a new Naive Bayes classifier, that selects five best features from the data set before learning. The new classifier is wrapped-up in a special class (see Learners in Python lesson in Orange Tutorial). Th script compares this filtered learner with one that uses a complete set of features.

selection-bayes.py

import Orange


class BayesFSS(object):
    def __new__(cls, examples=None, **kwds):
        learner = object.__new__(cls)
        if examples:
            return learner(examples)
        else:
            return learner
    
    def __init__(self, name='Naive Bayes with FSS', N=5):
        self.name = name
        self.N = 5
      
    def __call__(self, table, weight=None):
        ma = Orange.feature.scoring.score_all(table)
        filtered = Orange.feature.selection.selectBestNAtts(table, ma, self.N)
        model = Orange.classification.bayes.NaiveLearner(filtered)
        return BayesFSS_Classifier(classifier=model, N=self.N, name=self.name)

class BayesFSS_Classifier:
    def __init__(self, **kwds):
        self.__dict__.update(kwds)
    
    def __call__(self, example, resultType = Orange.classification.Classifier.GetValue):
        return self.classifier(example, resultType)


# test above wraper on a data set
voting = Orange.data.Table("voting")
learners = (Orange.classification.bayes.NaiveLearner(name='Naive Bayes'),
            BayesFSS(name="with FSS"))
results = Orange.evaluation.testing.cross_validation(learners, voting)

# output the results
print "Learner      CA"
for i in range(len(learners)):
    print "%-12s %5.3f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i])

Interestingly, and somehow expected, feature subset selection helps. This is the output that we get:

Learner      CA
Naive Bayes  0.903
with FSS     0.940

We can do all of he above by wrapping the learner using FilteredLearner, thus creating an object that is assembled from data filter and a base learner. When given a data table, this learner uses attribute filter to construct a new data set and base learner to construct a corresponding classifier. Attribute filters should be of the type like FilterAboveThreshold or FilterBestN that can be initialized with the arguments and later presented with a data, returning new reduced data set.

The following code fragment replaces the bulk of code from previous example, and compares naive Bayesian classifier to the same classifier when only a single most important attribute is used.

selection-filtered-learner.py

nb = Orange.classification.bayes.NaiveLearner()
fl = Orange.feature.selection.FilteredLearner(nb,
     filter=Orange.feature.selection.FilterBestNAtts(n=1), name='filtered')
learners = (Orange.classification.bayes.NaiveLearner(name='bayes'), fl)

Now, let’s decide to retain three features and observe how many times an attribute was used. Remember, 10-fold cross validation constructs ten instances for each classifier, and each time we run FilteredLearner a different set of features may be selected. Orange.evaluation.testing.cross_validation stores classifiers in results variable, and FilteredLearner returns a classifier that can tell which features it used, so the code to do all this is quite short.

print "\nNumber of times attributes were used in cross-validation:"
attsUsed = {}
for i in range(10):
    for a in results.classifiers[i][1].atts():
        if a.name in attsUsed.keys():
            attsUsed[a.name] += 1
        else:
            attsUsed[a.name] = 1
for k in attsUsed.keys():
    print "%2d x %s" % (attsUsed[k], k)

Running selection-filtered-learner.py with three features selected each time a learner is run gives the following result:

Learner      CA
bayes        0.903
filtered     0.956

Number of times features were used in cross-validation:
 3 x el-salvador-aid
 6 x synfuels-corporation-cutback
 7 x adoption-of-the-budget-resolution
10 x physician-fee-freeze
 4 x crime

References

  • K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Proc. 9th Int’l Conf. on Machine Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
  • I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine Learning (ECML-94), pages 171-182. Springer-Verlag, 1994.
  • R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial Intelligence, 97 (1-2), pages 273-324, 1997