k-nearest neighbors (knn)

The nearest neighbors algorithm is one of the most basic, lazy machine learning algorithms. The learner only stores the training data, and the classifier makes predictions based on the instances most similar to the data instance being classified:

import Orange
iris = Orange.data.Table("iris")

knnLearner = Orange.classification.knn.kNNLearner()
knnLearner.k = 10
knnClassifier = knnLearner(iris)
class Orange.classification.knn.kNNLearner(k, distance_constructor, weight_id)

Lazy classifier that stores instances from the training set. Constructor parameters set the corresponding attributes.

k

Number of nearest neighbors used in classification. If 0 (default), the square root of the numbers of instances is used.

distance_constructor

Component that constructs the object for measuring distances between instances. Defaults to Euclidean.

weight_id

Id of meta attribute with instance weights.

rank_weight

If True (default), neighbours are weighted according to their order and not their (normalized) distances to the instance that is being classified.

__call__(data)

Return a kNNClassifier. Learning consists of constructing a distance measure and passing it to the classifier along with instances and attributes (k, rank_weight and weight_id).

Parameters:instances (Table) – training instances
class Orange.classification.knn.kNNClassifier(domain, weight_id, k, find_nearest, rank_weight, n_examples)
__call__(instance, return_type)
Parameters:
  • instance (Orange.data.Instance) – given instance to be classified
  • return_type (GetBoth, GetValue, GetProbabilities) – return value and probabilities, only value or only probabilities
Return type:

Value, Distribution or a tuple with both

find_nearest

A callable component that finds the nearest k neighbors of the given instance.

Parameters:instance (Instance) – given instance
Return type:Orange.data.Instance
k

Number of neighbors. If set to 0 (which is also the default value), the square root of the number of examples is used.

weight_id

Id of meta attribute with instance weights.

rank_weight

If True (default), neighbours are weighted according to their order and not their (normalized) distances to the instance that is being classified.

n_examples

The number of learning instances, used to compute the number of neighbors if the value of kNNClassifier.k is zero.

When called to classify instance inst, the classifier first calls kNNClassifier.find_nearest(inst) to retrieve a list with kNNClassifier.k nearest neighbors. The component kNNClassifier.find_nearest() has a stored table of training instances together with their weights. If instances are weighted (non-zero weight_id), weights are considered when counting the neighbors.

If kNNClassifier.find_nearest() returns only one neighbor (this is the case if k=1), kNNClassifier returns the neighbor’s class.

Otherwise, the retrieved neighbors vote for the class prediction or probability of classes. Voting can be a product of two weights: weights of training instances, if they are given, and weights that reflect the distance from inst. Nearer neighbors have a greater impact on the prediction: the weight is computed as -exp(-t^2 / s^2) , where the meaning of t depends on the setting of rank_weight.

  • if rank_weight is False, t is the distance from the instance being classified
  • if rank_weight is True, neighbors are ordered and t is the position of the neighbor on the list (a rank)

In both cases, s is chosen so that the weight of the farthest instance is 0.001.

Weighting gives the classifier a certain insensitivity to the number of neighbors used, making it possible to use large k‘s.

The classifier can use continuous and discrete features, and can even distinguish between ordinal and nominal features. See information on distance measuring for details.

Examples

The learner will be tested on an ‘iris’ data set. The data will be split into training (80%) and testing (20%) instances. We will use the former for “training” the classifier and test it on five testing instances randomly selected from a part of (knnlearner.py):

import Orange
iris = Orange.data.Table("iris")

rndind = Orange.data.sample.SubsetIndices2(iris, p0=0.8)
train = iris.select(rndind, 0)
test = iris.select(rndind, 1)

knn = Orange.classification.knn.kNNLearner(train, k=10)
for i in range(5):
    instance = test.random_example()
    print instance.getclass(), knn(instance)

The output of this code is:

Iris-setosa Iris-setosa
Iris-versicolor Iris-versicolor
Iris-versicolor Iris-versicolor
Iris-setosa Iris-setosa
Iris-setosa Iris-setosa

The choice of metric usually has not greater impact on the performance of kNN classifiers, so default should work fine. To change it, distance_constructor must be set to an instance of one of the classes for distance measuring.

knn = Orange.classification.knn.kNNLearner()
knn.k = 10
knn.distance_constructor = Orange.distance.Hamming()
knn = knn(iris)

Finding nearest neighbors

Orange provides classes for finding the nearest neighbors of a given reference instance.

As usual in Orange, there are two classes: one that does the work (FindNearest) and another that constructs the former from data (FindNearestConstructor).

class Orange.classification.knn.FindNearest

Brute force search for nearest neighbors in the stored data table.

distance

An instance of Orange.distance.Distance used for computing distances between data instances.

instances

Stored data table

weight_ID

ID of meta attribute with weight. If present (non-null) the class does not return k instances but a set of instances with a total weight of k.

distance_ID

The id of meta attribute that will be added to the found neighbours and to store the distances between the returned data instances and the reference. If zero, the distances is not stored.

__call__(instance, k)

Return a data table with k nearest neighbours of instance. Any ties for the last place(s) are resolved by randomly picking the appropriate number of instances. A local random generator is constructed and seeded by a constant computed from instance, so the same random neighbors are always returned for the same instance.

Parameters:
Return type:

Orange.data.Table

class Orange.classification.knn.FindNearestConstructor

A class that constructs FindNearest and initializes it with a distance metric, constructed by distance_constructor.

distance_constructor

An instance of Orange.distance.DistanceConstructor that “learns” to measure distances between instances. Learning can mean, for example, storing the ranges of continuous features or the number of values of a discrete feature. The result of learning is an instance of Orange.distance.Distance that is used for measuring distances between instances.

include_same

Indicates whether to include the instances that are same as the reference; default is true.

__call__(data, weight_ID, distance_ID)

Constructs an instance of FindNearest for the given data. Arguments weight_ID and distance_ID are copied to the new object.

Parameters:
  • table (Orange.data.Table) – table of instances
  • weight_ID (int) – id of meta attribute with weights of instances
  • distance_ID (int) – id of meta attribute that will store distances
Return type:

FindNearest

Examples

The following script (knnInstanceDistance.py) shows how to find the five nearest neighbors of the first instance in the lenses dataset.

import Orange

lenses = Orange.data.Table("lenses")

nnc = Orange.classification.knn.FindNearestConstructor()
nnc.distance_constructor = Orange.distance.Euclidean()

did = Orange.feature.Descriptor.new_meta_id()
nn = nnc(lenses, 0, did)

print "*** Reference instance: ", lenses[0]
for inst in nn(lenses[0], 5):
    print inst