Classification trees (`tree`)¶

Orange includes multiple implementations of classification tree learners: a very flexible TreeLearner, a fast SimpleTreeLearner, and a C45Learner, which uses the C4.5 tree induction algorithm.

The following code builds a TreeClassifier on the Iris data set (with the depth limited to three levels):

import Orange

iris = Orange.data.Table("iris")
tree = Orange.classification.tree.TreeLearner(iris, max_depth=3)

See Decision tree learning on Wikipedia for introduction to classification trees.

Component-based Tree Inducer¶

class Orange.classification.tree.TreeLearner(**kw)¶

A classification or regression tree learner. If a set of instances is given on initialization, a TreeClassifier is built and returned instead.

The learning algorithm has a large number of parameters. The class provides reasonable defaults; they can be modified either as attributes or as arguments given to the constructor.

The algorithm is very flexible, yet slower than the other two implementations that are more suitable for large scale experiments.

The tree induction process

The learning instances are copied, unless store_instances is False and the instance already are stored in a Table.
Apriori class probabilities are computed. A list of candidate features for the split is compiled; in the beginning, all features are candidates.
The recursive part. The contingency matrix is computed by contingency_computer. Contingencies are used by split, stop and splitter.
If the induction should stop, a node_classifier is built by calling node_learner with the given instances, weight ID and the contingency matrix. As the learner uses contingencies whenever possible, the contingency_computer will affect the node_classifier. The node is returned.
If the induction continues, a split is called. If split fails to return a branch selector, induction stops and the Node is returned.
The feature spent (if any) is removed from the candidate list.
Instances are divided into child nodes with splitter. The process recursively continues with step 3 for each of the non-empty subsets. If the splitter returned weights, they are used for each branch.

Attributes

node_learner¶: Induces a classifier from instances in a node. It is used both for internal nodes and leaves. The default is Orange.classification.majority.MajorityLearner.

descender¶: The descender that the induced TreeClassifier will use. The default is Descender_UnknownMergeAsSelector. It votes with the branch_selector‘s distribution for vote weights.

contingency_computer¶: Defines the computation of contingency matrices (used by split, stop, splitter). It can be used, for example, to change the treatment of unknown values. By default ordinary contingency matrices are computed.

Split construction

split¶: A SplitConstructor or a function with the same signature as SplitConstructor.__call__. It is useful for prototyping new tree induction algorithms. If split is defined, other arguments that affect split construction are ignored. These include binarization, measure, worst_acceptable and min_subset. Default: SplitConstructor_Combined with separate constructors for discrete and continuous features. Discrete features are used as they are, while continuous are binarized. Features are scored with gain ratio. At least two instances in a leaf are required for discrete and five for continuous features.

binarization¶: If 1, SplitConstructor_ExhaustiveBinary is used. If 2, use SplitConstructor_OneAgainstOthers. If 0, do not use binarization (use SplitConstructor_Feature). Default: 0.

measure¶: A score to evaluate features for splitting instances in a node. A subclass of Orange.feature.scoring.Score (perhaps InfoGain, GainRatio, Gini, Relief, or MSE). Default: Orange.feature.scoring.GainRatio.

relief_m, relief_k: Set m and k for Relief, if chosen.

splitter¶: Splitter or a function with the same signature as Splitter.__call__. The default is Splitter_UnknownsAsSelector that splits the learning instances according to distributions given by the selector.

Pruning

worst_acceptable¶: The lowest required feature score. If the score of the best feature is below this margin, the tree is not grown further (default: 0).

min_subset¶: The lowest required number of instances in non-null leaves (default: 0).

min_instances¶: Data subsets with less than min_instances instances are not split any further. Therefore, all leaves in the tree will contain at least that many instances (default: 0).

max_depth¶: Maximal tree depth. If 0, only root is generated. The default is 100.

max_majority¶: Induction stops when the proportion of majority class in the node exceeds the value set by this parameter (default: 1.0).

stop¶: StopCriteria or a function with the same signature as StopCriteria.__call__. Useful for prototyping new tree induction algorithms. When used, parameters max_majority and min_instances will not be considered. The default stopping criterion stops induction when all instances in a node belong to the same class.

m_pruning¶: If non-zero, invokes an error-based bottom-up post-pruning, where m-estimate is used to estimate class probabilities (default: 0).

same_majority_pruning¶: If true, invokes a bottom-up post-pruning by removing the subtrees of which all leaves classify to the same class (default: False).

Record keeping

store_distributions¶

store_contingencies¶

store_instances¶

store_node_classifier¶: Determines whether to store class distributions, contingencies and instances in Node, and whether the Node.node_classifier should be build for internal nodes also (it is needed by the Descender or for post-pruning). Not storing distributions but storing contingencies does not save any memory, since distributions actually points to the same distribution that is stored in contingency.classes. By default everything except store_instances is enabled.

__call__(instances, weight=0)¶: Return a classifier from the given instances.

instance()¶: DEPRECATED. Return a base learner - an object of _TreeLearner. This method is left for backwards compatibility.

class Orange.classification.tree.TreeClassifier(base_classifier=None)¶

Classifies instances according to the tree stored in tree.

The classification process

TreeClassifier uses the descender to descend the instance from the root. If the descender returns only a Node and no distribution, the descend should stop as the node was unambiguously selected. The node’s node_classifier decides the class.

If the descender returns a Node and a distribution, as it happens, for example, if the instance’s value for the Node‘s feature is unknown, the same process repeats for all subtrees and their predictions are combined.

Attributes

tree¶: The root of the tree, as a Node.

descender¶: A Descender used to descend an instance from the root as deeply as possible according to the instance’s feature values.

__call__(instance, result_type=0, *args, **kwdargs)¶

Classify a new instance.

Parameters:	instance (`Orange.data.Instance`) – instance to be classified. result_type – `Orange.classification.Classifier.GetValue` or `Orange.classification.Classifier.GetProbabilities` or `Orange.classification.Classifier.GetBoth`
Return type:	`Orange.data.Value`, `Orange.statistics.Distribution` or a tuple with both

count_leaves()¶: Return the number of leaves.

count_nodes()¶: Return the number of nodes.

dot(file_name, leaf_str=, node_str=, leaf_shape=plaintext, node_shape=plaintext, user_formats=, []min_instances=0, max_depth=10000000000, simple_first=True)¶

Print the tree to a file in a format used by GraphViz. Uses the same parameters as to_string() plus two which define the shape of internal nodes and leaves of the tree:

Parameters:	leaf_shape (string) – Shape of the outline around leaves of the tree. If “plaintext”, no outline is used (default: “plaintext”). node_shape (string) – Shape of the outline around internal nodes of the tree. If “plaintext”, no outline is used (default: “plaintext”)

Check Polygon-based Nodes for various outlines supported by GraphViz.

to_string(leaf_str=, node_str=, user_formats=, []min_instances=0, max_depth=10000000000, simple_first=True)¶

Return a string representation of a tree.

Parameters:

leaf_str (string) – The format string for the tree leaves. If left empty, "%V (%^.2m%)" will be used for classification trees and "%V" for regression trees.
node_str (string) – The format string for the internal nodes. If left empty (as it is by default), no data is printed out for internal nodes. If set to ".", the same string is used as for leaves.
max_depth (integer) – If set, it limits the depth to which the tree is printed out.
min_instances (integer) – If set, the subtrees with less than the given number of examples are not printed.
simple_first (boolean) – If True (default), the branches with a single node are printed before the branches with larger subtrees. If False, the branches are printed in order of appearance.
user_formats – A list of regular expressions and callback function through which the user can print out other specific information in the nodes.

class Orange.classification.tree.Node¶

Classification trees are a hierarchy of Node classes.

Node stores the instances belonging to the node, a branch selector, a list of branches (if the node is not a leaf) with their descriptions and strengths, and a classifier.

distribution¶: A distribution of learning instances.

contingency¶: Complete contingency matrices for the learning instances.

instances, weightID: Learning instances and the ID of weight meta attribute. The root of the tree actually stores all instances, while other nodes store only reference to instances in the root node.

node_classifier¶: A classifier for instances coming to the node. If the node is a leaf, it chooses the class (or class distribution) of an instance.

Internal nodes have additional attributes. The lists branches, branch_descriptions and branch_sizes are of the same length.

branches¶: A list of subtrees. Each element is a Node or None. If None, the node is empty.

branch_descriptions¶: A list with strings describing branches. They are constructed by SplitConstructor. A string can contain anything, for example ‘red’ or ‘>12.3’.

branch_sizes¶: A (weighted) number of training instances for each branch. It can be used, for instance, for modeling probabilities when classifying instances with unknown values.

branch_selector¶: A Classifier that returns a branch for each instance (as Orange.data.Value in [0, len(branches)-1]). When an instance cannot be classified unambiguously, the selector can return a discrete distribution, which proposes how to divide the instance between the branches. Whether the proposition will be used depends upon the Splitter (for learning) or Descender (for classification).

tree_size()¶: Return the number of nodes in the subtrees (including the node, excluding null-nodes).

Examples¶

Tree Structure¶

This example works with the lenses data set:

>>> import Orange
>>> lenses = Orange.data.Table("lenses")
>>> tree_classifier = Orange.classification.tree.TreeLearner(lenses)

The following function counts the number of nodes in a tree:

>>> def tree_size(node):
...    if not node:
...        return 0
...    size = 1
...    if node.branch_selector:
...        for branch in node.branches:
...            size += tree_size(branch)
...    return size

If node is None, the function above return 0. Otherwise, the size is 1 (this node) plus the sizes of all subtrees. The algorithm need to check if a node is internal (it has a branch_selector), as leaves don’t have the branches attribute.

>>> tree_size(tree_classifier.tree)
15

Note that a Node already has a built-in method tree_size().

Trees can be printed with a simple recursive function:

>>> def print_tree0(node, level):
...     if not node:
...         print " "*level + "<null node>"
...         return
...     if node.branch_selector:
...         node_desc = node.branch_selector.class_var.name
...         node_cont = node.distribution
...         print "\\n" + "   "*level + "%s (%s)" % (node_desc, node_cont),
...         for i in range(len(node.branches)):
...             print "\\n" + "   "*level + ": %s" % node.branch_descriptions[i],
...             print_tree0(node.branches[i], level+1)
...     else:
...         node_cont = node.distribution
...         major_class = node.node_classifier.default_value
...         print "--> %s (%s) " % (major_class, node_cont),

The crux of the example is not in the formatting (\n’s etc.); what matters is everything but the print statements. The code separately handles three node types:

For null nodes (a node to which no learning instances were classified), it just prints “<null node>”.
For internal nodes, it prints a node description: the feature’s name and distribution of classes. Node‘s branch description is an Classifier, and its class_var is the feature whose name is printed. Class distributions are printed as well (they are assumed to be stored). The print_tree0 with a level increased by 1 to increase the indent is recursively called for each branch.
If the node is a leaf, it prints the distribution of learning instances in the node and the class to which the instances in the node would be classified. We assume that the node_classifier is a DefaultClassifier. A better print function should be aware of possible alternatives.

The wrapper function that accepts either a TreeClassifier or a Node can be written as follows:

>>> def print_tree(x):
...     if isinstance(x, Orange.classification.tree.TreeClassifier):
...         print_tree0(x.tree, 0)
...     elif isinstance(x, Orange.classification.tree.Node):
...         print_tree0(x, 0)
...     else:
...         raise TypeError, "invalid parameter"

It’s straightforward: if x is a TreeClassifier, it prints x.tree; if it’s Node it print x. If it’s of some other type, an exception is raised. The output:

>>> print_tree(tree_classifier)

tear_rate (<15.000, 4.000, 5.000>)
: normal
   astigmatic (<3.000, 4.000, 5.000>)
   : no
      age (<1.000, 0.000, 5.000>)
      : pre-presbyopic --> soft (<0.000, 0.000, 2.000>)
      : presbyopic
         prescription (<1.000, 0.000, 1.000>)
         : hypermetrope --> soft (<0.000, 0.000, 1.000>)
         : myope --> none (<1.000, 0.000, 0.000>)
      : young --> soft (<0.000, 0.000, 2.000>)
   : yes
      prescription (<2.000, 4.000, 0.000>)
      : hypermetrope
         age (<2.000, 1.000, 0.000>)
         : pre-presbyopic --> none (<1.000, 0.000, 0.000>)
         : presbyopic --> none (<1.000, 0.000, 0.000>)
         : young --> hard (<0.000, 1.000, 0.000>)
      : myope --> hard (<0.000, 3.000, 0.000>)
: reduced --> none (<12.000, 0.000, 0.000>)

The tree structure examples conclude with a simple pruning function, written entirely in Python and unrelated to any Pruner. It limits the tree depth (the number of internal nodes on any path down the tree). For example, to get a two-level tree, call cut_tree(root, 2). The function is recursive, with the second argument (level) decreasing at each call; when zero, the current node will be made a leaf:

>>> def cut_tree(node, level):
...     if node and node.branch_selector:
...         if level:
...             for branch in node.branches:
...                 cut_tree(branch, level-1)
...         else:
...             node.branch_selector = None
...             node.branches = None
...             node.branch_descriptions = None

The function acts only when node and node.branch_selector are defined. If the level is not zero, is recursively calls the function for each branch. Otherwise, it clears the selector, branches and branch descriptions.

>>> cut_tree(tree_classifier.tree, 2)
>>> print_tree(tree_classifier)

tear_rate (<15.000, 4.000, 5.000>)
: normal
   astigmatic (<3.000, 4.000, 5.000>)
   : no --> soft (<1.000, 0.000, 5.000>)
   : yes --> hard (<2.000, 4.000, 0.000>)
: reduced --> none (<12.000, 0.000, 0.000>)

Setting learning parameters¶

Let us construct a TreeLearner to play with:

>>> import Orange
>>> lenses = Orange.data.Table("lenses")
>>> learner = Orange.classification.tree.TreeLearner()

There are three crucial components in learning: the split and stop criteria, and the example splitter. The default stop is set with:

>>> learner.stop = Orange.classification.tree.StopCriteria_common()

The default stopping parameters are:

>>> print learner.stop.max_majority, learner.stop.min_examples
1.0 0.0

The defaults only stop splitting when no instances are left or all of them are in the same class.

If the minimal subset that is allowed to be split further is set to five instances, the resulting tree is smaller.

>>> learner.stop.min_examples = 5.0
>>> tree = learner(lenses)
>>> print tree
tear_rate=reduced: none (100.00%)
tear_rate=normal
|    astigmatic=no
|    |    age=pre-presbyopic: soft (100.00%)
|    |    age=presbyopic: none (50.00%)
|    |    age=young: soft (100.00%)
|    astigmatic=yes
|    |    prescription=hypermetrope: none (66.67%)
|    |    prescription=myope: hard (100.00%)

We can also limit the maximal proportion of majority class.

>>> learner.stop.max_majority = 0.5
>>> tree = learner(lenses)
>>> print tree
none (62.50%)

Redefining tree induction components¶

This example shows how to use a custom stop function. First, the def_stop function defines the default stop function. The first tree has some added randomness; the induction also stops in 20% of the cases when def_stop returns False. The stopping criteria for the second tree is completely random: it stops induction in 20% of cases. Note that in the second case lambda function still has three parameters, even though in does not need any, since so many are necessary for stop.

import Orange
from random import randint, seed
seed(0)

iris = Orange.data.Table("iris.tab")

print "SOME RANDOMNESS IN STOPING:"
def_stop = Orange.classification.tree.StopCriteria()
f = lambda examples, weightID, contingency: def_stop(examples, weightID, contingency) or randint(1, 5) == 1
l = Orange.classification.tree.TreeLearner(iris, stop=f)
print l

print "\nRANDOM STOPING:"
f = lambda x,y,z: randint(1, 5)==1
l = Orange.classification.tree.TreeLearner(iris, stop=f)
print l

Learner and Classifier Components¶

Split constructors¶

class Orange.classification.tree.SplitConstructor¶

Decide how to divide learning instances, ie. define branching criteria.

The SplitConstructor should use the domain contingency when possible, both for speed and adaptability. Sometimes domain contingency does not suffice, for example if ReliefF score is used.

A SplitConstructor can veto further tree induction by returning no classifier. This is generally related to the number of learning instances that would go in each branch. If there are no splits with more than SplitConstructor.min_subset instances in the branches (null nodes are allowed), the induction is stopped.

Split constructors that cannot handle a particular feature type (discrete, continuous) quietly skip them. When in doubt, use SplitConstructor_Combined, which delegates features to specialized split constructors.

The same split constructors can be used both for classification and regression, if the chosen score (for SplitConstructor_Score and derived classes) supports both.

min_subset¶: The minimal (weighted) number in non-null leaves.

__call__(data[, weightID, contingency, apriori_distribution, candidates, clsfr])¶

Parameters:

data – in any acceptable form.
weightID – Optional; the default of 0 means that all instances have a weight of 1.0.
contingency – a domain contingency
apriori_distribution (Orange.statistics.distribution.Distribution) – apriori class probabilities.
candidates – only consider these features (one boolean for each feature).
clsfr – a node classifier (if it was constructed, that is, if store_node_classifier is True)

Construct a split. Return a tuple (branch_selector, branch_descriptions (a list), subset_sizes (the number of instances for each branch, may also be empty), quality (higher numbers mean better splits), spent_feature). If no split is constructed, the selector, branch_descriptions and subset_sizes are None, while quality is 0.0 and spent_feature is -1.

If the chosen feature will be useless in the future and should not be considered for splitting in any of the subtrees (typically, when discrete features are used as-they-are, without any binarization or subsetting), then it should return the index of this feature as spent_feature. If no features are spent, spent_feature is -1.

class Orange.classification.tree.SplitConstructor_Score¶

Bases: SplitConstructor

An abstract base class that compare splits with a Orange.feature.scoring.Score. All split constructors except for SplitConstructor_Combined are derived from this class.

measure¶: A Orange.feature.scoring.Score for split evaluation. It has to handle the class type - for example, you cannot use GainRatio for regression or MSE for classification.

worst_acceptable¶: The lowest allowed split quality. The value strongly depends on chosen measure component. Default is 0.0.

class Orange.classification.tree.SplitConstructor_Feature¶

Bases: SplitConstructor_Score

Each value of a discrete feature corresponds to a branch. The feature with the highest score (measure) is selected. When tied, a random feature is selected.

The constructed branch_selector is an instance of orange.ClassifierFromVarFD, that returns a value of the selected feature. branch_description contains the feature’s values. The feature is marked as spent (it cannot reappear in the node’s subtrees).

class Orange.classification.tree.SplitConstructor_ExhaustiveBinary¶

Bases: SplitConstructor_Score

Finds the binarization with the highest score among all features. In case of ties, a random feature is selected.

The constructed branch_selector is an instance orange.ClassifierFromVarFD, that returns a value of the selected feature. Its transformer contains a MapIntValue that maps values of the feature into a binary feature. Branches with a single feature value are described with that value and branches with more than one are described with [<val1>, <val2>, ..., <valn>]. Only binary features are marked as spent.

class Orange.classification.tree.SplitConstructor_Threshold¶

Bases: SplitConstructor_Score

The only split constructor for continuous features. It divides the range of feature values with a threshold that maximizes the split’s quality. In case of ties, a random feature is selected. The feature that yields the best binary split is returned.

The constructed branch_selector is an instance of orange.ClassifierFromVarFD with an attached transformer, of type Orange.feature.discretization.ThresholdDiscretizer. The branch descriptions are “<threshold” and “>=threshold”. The feature is not spent.

class Orange.classification.tree.SplitConstructor_OneAgainstOthers¶

Bases: SplitConstructor_Score

Undocumented.

class Orange.classification.tree.SplitConstructor_Combined¶

Bases: SplitConstructor

Uses different split constructors for discrete and continuous features. Each split constructor is called with appropriate features. Both construct a candidate for a split; the better of them is used.

The choice of the split is not probabilistically fair, when multiple candidates have the same score. For example, if there are nine discrete features with the highest score the split constructor for discrete features will select one of them. Now, if there is also a single continuous feature with the same score, SplitConstructor_Combined would randomly select between the proposed discrete feature and continuous feature, unaware that the discrete feature has already competed with eight others. So, the probability for selecting (each) discrete feature would be 1/18 instead of 1/10. Although incorrect, this should not affect the performance.

StopCriteria and StopCriteria_common¶

StopCriteria determines when to stop the induction of subtrees.

class Orange.classification.tree.StopCriteria¶

Provides the basic stopping criteria: the tree induction stops when there is at most one instance left (the actual, not weighted, number). The induction also stops when all instances are in the same class (for discrete problems) or have the same outcome value (for regression problems).

__call__(instances[, weightID, domain contingencies])¶: Return True (stop) of False (continue the induction). Contingencies are not used for counting. Derived classes should use the contingencies whenever possible.

class Orange.classification.tree.StopCriteria_common¶

Pre-pruning with additional criteria.

max_majority¶: Maximum proportion of majority class. When exceeded, induction stops.

min_instances¶: Minimum number of instances for splitting. Subsets with less than min_instances instances are not split further. The sample count is weighed.

Splitters¶

Splitters sort learning instances into branches (the branches are selected with a SplitConstructor, while a Descender decides the branch for an instance during classification).

Most splitters call Node.branch_selector and assign instances correspondingly. When the value is unknown they choose a particular branch or skip the instance.

Some splitters can also split instances: a weighed instance is used in more than than one subset. Each branch has a weight ID (usually, each its own ID) and all instances in that branch should have this meta attribute.

An instance that hasn’t been split has only one additional attribute (weight ID corresponding to the subset to which it went). Instance that is split between, say, three subsets, has three new meta attributes, one for each subset. The weights are used only when needed; when there is no splitting - no weight IDs are returned.

class Orange.classification.tree.Splitter¶

An abstract base class that splits instances into subsets.

__call__(node, instances[, weightID])¶

Parameters:	node (`Node`) – a node. instances – a set of instances weightID – weight ID.

Use the information in Node (particularly the branch_selector) to split the given set of instances into subsets. Return a tuple with a list of instance subsets and a list of weights. The list of weights is either a list of integers or None when no weights are added.

class Orange.classification.tree.Splitter_IgnoreUnknowns¶

Bases: Splitter

Ignores the instances for which no single branch can be determined.

class Orange.classification.tree.Splitter_UnknownsToCommon¶

Bases: Splitter

Places all ambiguous instances to a branch with the highest number of instances. If there is more than one such branch, one is selected at random and then used for all instances.

class Orange.classification.tree.Splitter_UnknownsToAll¶

Bases: Splitter

Splits instances with an unknown value of the feature into all branches.

class Orange.classification.tree.Splitter_UnknownsToRandom¶

Bases: Splitter

Selects a random branch for ambiguous instances.

class Orange.classification.tree.Splitter_UnknownsToBranch¶

Bases: Splitter

Constructs an additional branch for ambiguous instances. The branch’s description is “unknown”.

class Orange.classification.tree.Splitter_UnknownsAsBranchSizes¶

Bases: Splitter

Splits instances with unknown value of the feature according to proportions of instances in each branch.

class Orange.classification.tree.Splitter_UnknownsAsSelector¶

Bases: Splitter

Splits instances with unknown value of the feature according to distribution proposed by selector (usually the same as proportions of instances in branches).

Descenders¶

Descenders decide where should the instances that cannot be unambiguously put in a single branch go during classification (the branches are selected with a SplitConstructor, while a Splitter sorts instances during learning).

class Orange.classification.tree.Descender¶

An abstract base tree descender. It descends a an instance as deep as possible, according to the values of instance’s features. The Descender: calls the node’s branch_selector to get the branch index. If it’s a simple index, the corresponding branch is followed. If not, the descender decides what to do. A descender can choose a single branch (for instance, the one that is the most recommended by the branch_selector) or it can let the branches vote.

Three are possible outcomes of a descent:

The descender reaches a leaf. This happens when there were no unknown or out-of-range values, or when the descender selected a single branch and continued the descend despite them. The descender returns the Node it has reached.
Node’s branch_selector returned a distribution and Descender decided to stop the descend at this (internal) node. It returns the current Node.
Node’s branch_selector returned a distribution and the Node wants to split the instance (i.e., to decide the class by voting). It returns a Node and the vote-weights for the branches. The weights can correspond, for example, to the distribution returned by node’s branch_selector, or to the number of learning instances that were assigned to each branch.

__call__(node, instance)¶: Descends until it reaches a leaf or a node in which a vote of subtrees is required. A tuple of two elements is returned. If it reached a leaf, the tuple contains the leaf node and None. If not, it contains a node and a list of floats (weights of votes).

class Orange.classification.tree.Descender_UnknownToNode¶

Bases: Descender

When instance cannot be classified into a single branch, the current node is returned. Thus, the node’s node_classifier will be used to make a decision. Therefore, internal nodes need to have Node.node_classifier defined.

class Orange.classification.tree.Descender_UnknownToBranch¶

Bases: Descender

Classifies instances with unknown value to a special branch. This makes sense only if the tree itself was constructed with Splitter_UnknownsToBranch.

class Orange.classification.tree.Descender_UnknownToCommonBranch¶

Bases: Descender

Classifies instances with unknown values to the branch with the highest number of instances. If there is more than one such branch, random branch is chosen for each instance.

class Orange.classification.tree.Descender_UnknownToCommonSelector¶

Bases: Descender

Classifies instances with unknown values to the branch which received the highest recommendation by the selector.

class Orange.classification.tree.Descender_UnknownMergeAsBranchSizes¶

Bases: Descender

The subtrees vote for the instance’s class; the vote is weighted according to the sizes of the branches.

class Orange.classification.tree.Descender_UnknownMergeAsSelector¶

Bases: Descender

The subtrees vote for the instance’s class; the vote is weighted according to the selectors proposal.

Pruning¶

The pruners construct a shallow copy of a tree. The pruned tree’s Node contain references to the same contingency matrices, node classifiers, branch selectors, ... as the original tree.

Pruners cannot construct a new node_classifier. Thus, for pruning, internal nodes must have node_classifier defined (the default).

class Orange.classification.tree.Pruner¶

An abstract base tree pruner.

__call__(tree)¶

Parameters:	tree – either a `Node` or (the C++ version of the classifier, saved in `TreeClassfier.base_classifier`).

The resulting pruned tree is of the same type as the argument. The original tree remains intact.

class Orange.classification.tree.Pruner_SameMajority¶

Bases: Pruner

A tree can have a subtrees where all the leaves have the same majority class. This is allowed because leaves can still have different class distributions and thus predict different probabilities. The Pruner_SameMajority prunes the tree so that there is no subtree in which all the nodes would have the same majority class.

This pruner will only prune the nodes in which the node classifier is a ConstantClassifier (or a derived class).

The pruning works from leaves to the root. It siblings have (at least one) common majority class, they can be pruned.

class Orange.classification.tree.Pruner_m¶

Bases: Pruner

Prunes a tree by comparing m-estimates of static and dynamic error as defined in (Bratko, 2002).

m¶: Parameter m for m-estimation.

Printing the tree¶

The tree printing functions are very flexible. They can print, for example, numbers of instances, proportions of majority class in nodes and similar, or more complex statistics like the proportion of instances in a particular class divided by the proportion of instances of this class in a parent node. Users may also pass their own functions to print certain elements.

The easiest way to print the tree is to print TreeClassifier():

>>> print tree
petal width<0.800: Iris-setosa (100.00%)
petal width>=0.800
|    petal width<1.750
|    |    petal length<5.350: Iris-versicolor (94.23%)
|    |    petal length>=5.350: Iris-virginica (100.00%)
|    petal width>=1.750
|    |    petal length<4.850: Iris-virginica (66.67%)
|    |    petal length>=4.850: Iris-virginica (100.00%)

Format string¶

Format strings are printed at every leaf or internal node with the certain format specifiers replaced by data from the tree node. Specifiers are generally of form %[^]<precision><quantity><divisor>.

^ at the start tells that the number should be multiplied by 100, which is useful for proportions like percentages.

<precision> is in the same format as in Python (or C) string formatting. For instance, %N denotes the number of instances in the node, hence %6.2N would mean output to two decimal digits and six places altogether. If left out, a default format 5.3 is used, unless the numbers are multiplied by 100, in which case the default is .0 (no decimals, the number is rounded to the nearest integer).

<divisor> tells what to divide the quantity in that node with. bP means division by the same quantity in the parent node; for instance, %NbP gives the number of instances in the node divided by the number of instances in parent node. Precision formatting can be added, e.g. %6.2NbP. bA denotes division by the same quantity over the entire data set, so %NbA will tell you the proportion of instances (out of the entire training data set) that fell into that node. If division is impossible since the parent node does not exist or some data is missing, a dot is printed out instead.

<quantity> defines what to print and is the only required element. It can be:

V

The predicted value at that node. Precision or divisor can not be defined here.

N

The number of instances in the node.

M

The number of instances in the majority class (that is, the class predicted by the node).

m

The proportion of instances in the majority class.

A

The average class for instances the node; this is available only for regression trees.

E

Standard error for class of instances in the node; available only for regression trees.

I

Print out the confidence interval. The modifier is used as %I(95) of (more complicated) %5.3I(95)bP.

C

The number of instances in the given class. For a classification example, %5.3C="Iris-virginica"bP denotes the number of instances of Iris-virginica by the number of instances this class in the parent node ( instances that are not Iris-virginica could be described with %5.3CbP!="Iris-virginica").

For regression trees, use operators =, !=, <, <=, >, and >=, as in %C<22, with optional precision and divisor. Intervals are also possible: %C[20, 22] gives the number of instances between 20 and 22 (inclusive) and %C(20, 22) gives the number of such instances excluding the boundaries. Mixing of parentheses is allowed, e.g. %C(20, 22]. Add ! for instances outside the interval, like %C!(20, 22].

c

Same as C, except that it computes the proportion of the class instead of the number of instances.

D

The number of instances in each class. Precision and the divisor are applied to each number in the distribution. This quantity can not be computed for regression trees.

d

Same as D, except that it shows proportions of instances.

<user defined formats>

Instructions and examples of added formats are at the end of this section.

Examples on classification trees

A tree on the iris data set with the depth limited to three levels is built as follows:

import Orange

iris = Orange.data.Table("iris")
tree = Orange.classification.tree.TreeLearner(iris, max_depth=3)

Printing the predicted class at each node, the number of instances in the majority class with the total number of instances in the node requires a custom format string:

>>> print tree.to_string(leaf_str="%V (%M out of %N)")
petal width<0.800: Iris-setosa (50.000 out of 50.000)
petal width>=0.800
|    petal width<1.750
|    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
|    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
|    petal width>=1.750
|    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
|    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)

The number of instances as compared to the entire data set and to the parent node:

>>> print tree.to_string(leaf_str="%V (%^MbA%, %^MbP%)")
petal width<0.800: Iris-setosa (100%, 100%)
petal width>=0.800
|    petal width<1.750
|    |    petal length<5.350: Iris-versicolor (98%, 100%)
|    |    petal length>=5.350: Iris-virginica (4%, 40%)
|    petal width>=1.750
|    |    petal length<4.850: Iris-virginica (4%, 4%)
|    |    petal length>=4.850: Iris-virginica (86%, 96%)

%M is the number of instances in the majority class. Dividing by the number of all instances from this class on the entire data set is described with %MbA. Add ^ in front for mutiplication with 100. The percent sign after that is printed out literally, just as the comma and parentheses. For the proportion of this class in the parent the bA is replaced with bA.

To print the number of versicolors in each node, together with the proportion of versicolors among the instances in this particular node and among all versicolors, use the following:

'%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)

It gives:

petal width<0.800: 0.000 (0% of node, 0% of versicolors)
petal width>=0.800
|    petal width<1.750
|    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
|    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
|    petal width>=1.750
|    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
|    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)

Finally, to print the distributions using a format string %D:

petal width<0.800: [50.000, 0.000, 0.000]
petal width>=0.800
|    petal width<1.750
|    |    petal length<5.350: [0.000, 49.000, 3.000]
|    |    petal length>=5.350: [0.000, 0.000, 2.000]
|    petal width>=1.750
|    |    petal length<4.850: [0.000, 1.000, 2.000]
|    |    petal length>=4.850: [0.000, 0.000, 43.000]

As the order of classes is the same as in data.domain.class_var.values (setosa, versicolor, virginica), there are 49 versicolors and 3 virginicae in the node at petal length<5.350. To print the proportions within nodes rounded to two decimals use %.2d:

petal width<0.800: [1.00, 0.00, 0.00]
petal width>=0.800
|    petal width<1.750
|    |    petal length<5.350: [0.00, 0.94, 0.06]
|    |    petal length>=5.350: [0.00, 0.00, 1.00]
|    petal width>=1.750
|    |    petal length<4.850: [0.00, 0.33, 0.67]
|    |    petal length>=4.850: [0.00, 0.00, 1.00]

The most trivial format string for internal nodes is for printing node predictions. . in the following example specifies that the node_str should be the same as leaf_str.

tree.to_string(leaf_str="%V", node_str=".")

The output:

root: Iris-setosa
|    petal width<0.800: Iris-setosa
|    petal width>=0.800: Iris-versicolor
|    |    petal width<1.750: Iris-versicolor
|    |    |    petal length<5.350: Iris-versicolor
|    |    |    petal length>=5.350: Iris-virginica
|    |    petal width>=1.750: Iris-virginica
|    |    |    petal length<4.850: Iris-virginica
|    |    |    petal length>=4.850: Iris-virginica

A node root has appeared and the tree looks one level deeper. This is needed to also print the data for tree root.

To observe how the number of virginicas decreases down the tree try:

print tree.to_string(leaf_str='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', node_str='.')

Interpretation: CbA="Iris-virginica" is the number of instances from virginica, divided by the total number of instances in this class. Add ^.1 and the result will be multiplied and printed with one decimal. The trailing % is printed out. In parentheses the same thing was divided by the instances in the parent node. The single quotes were used for strings, so that double quotes inside the string can specify the class.

root: 100.0% (.%)
|    petal width<0.800: 0.0% (0.0%)
|    petal width>=0.800: 100.0% (100.0%)
|    |    petal width<1.750: 10.0% (10.0%)
|    |    |    petal length<5.350: 6.0% (60.0%)
|    |    |    petal length>=5.350: 4.0% (40.0%)
|    |    petal width>=1.750: 90.0% (90.0%)
|    |    |    petal length<4.850: 4.0% (4.4%)
|    |    |    petal length>=4.850: 86.0% (95.6%)

If to_string() cannot compute something, in this case because the root has no parent, it prints out a dot.

The final example with classification trees prints the distributions in nodes, the distribution compared to the parent, the proportions compared to the parent and the predicted class in the leaves:

>>> print tree.to_string(leaf_str='"%V   %D %.2DbP %.2dbP"', node_str='"%D %.2DbP %.2dbP"')
root: [50.000, 50.000, 50.000] . .
|    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
|        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
|    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
|    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
|    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
|    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
|    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
|    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
|    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
|    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
|    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
|    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
|    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]

Examples on regression trees

The regression trees examples use a tree induced from the housing data set. Without other argumets, TreeClassifier.to_string() prints the following:

RM<6.941
|    LSTAT<14.400
|    |    DIS<1.385: 45.6
|    |    DIS>=1.385: 22.9
|    LSTAT>=14.400
|    |    CRIM<6.992: 17.1
|    |    CRIM>=6.992: 12.0
RM>=6.941
|    RM<7.437
|    |    CRIM<7.393: 33.3
|    |    CRIM>=7.393: 14.4
|    RM>=7.437
|    |    TAX<534.500: 45.9
|    |    TAX>=534.500: 21.9

To add the standard error in both internal nodes and leaves, and the 90% confidence intervals in the leaves, use:

>>> print tree.to_string(leaf_str="[SE: %E]\t %V %I(90)", node_str="[SE: %E]")
root: [SE: 0.409]
|    RM<6.941: [SE: 0.306]
|    |    LSTAT<14.400: [SE: 0.320]
|    |    |    DIS<1.385: [SE: 4.420]:
|    |    |        [SE: 4.420]   45.6 [38.331-52.829]
|    |    |    DIS>=1.385: [SE: 0.244]:
|    |    |        [SE: 0.244]   22.9 [22.504-23.306]
|    |    LSTAT>=14.400: [SE: 0.333]
|    |    |    CRIM<6.992: [SE: 0.338]:
|    |    |        [SE: 0.338]   17.1 [16.584-17.691]
|    |    |    CRIM>=6.992: [SE: 0.448]:
|    |    |        [SE: 0.448]   12.0 [11.243-12.714]
|    RM>=6.941: [SE: 1.031]
|    |    RM<7.437: [SE: 0.958]
|    |    |    CRIM<7.393: [SE: 0.692]:
|    |    |        [SE: 0.692]   33.3 [32.214-34.484]
|    |    |    CRIM>=7.393: [SE: 2.157]:
|    |    |        [SE: 2.157]   14.4 [10.862-17.938]
|    |    RM>=7.437: [SE: 1.124]
|    |    |    TAX<534.500: [SE: 0.817]:
|    |    |        [SE: 0.817]   45.9 [44.556-47.237]
|    |    |    TAX>=534.500: [SE: 0.000]:
|    |    |        [SE: 0.000]   21.9 [21.900-21.900]

The predicted value (%V) and the average (%A) may differ because a regression tree does not always predict the leaf average, but whatever the node_classifier in a leaf returns. As %V uses the Orange.feature.Continuous‘ function for printing the value, the number has the same number of decimals as in the data file.

Regression trees cannot print the distributions in the same way as classification trees. They instead offer a set of operators for observing the number of instances within a certain range. For instance, to print the number of instances with values below 22 and compare it with values in the parent nodes use:

>>> print tree.to_string(leaf_str="%C<22 (%cbP<22)", node_str=".")
root: 277.000 (.)
|    RM<6.941: 273.000 (1.160)
|    |    LSTAT<14.400: 107.000 (0.661)
|    |    |    DIS<1.385: 0.000 (0.000)
|    |    |    DIS>=1.385: 107.000 (1.020)
|    |    LSTAT>=14.400: 166.000 (1.494)
|    |    |    CRIM<6.992: 93.000 (0.971)
|    |    |    CRIM>=6.992: 73.000 (1.040)
|    RM>=6.941: 4.000 (0.096)
|    |    RM<7.437: 3.000 (1.239)
|    |    |    CRIM<7.393: 0.000 (0.000)
|    |    |    CRIM>=7.393: 3.000 (15.333)
|    |    RM>=7.437: 1.000 (0.633)
|    |    |    TAX<534.500: 0.000 (0.000)
|    |    |    TAX>=534.500: 1.000 (30.000)</xmp>

The last line, for instance, says the the number of instances with the class below 22 is among those with tax above 534 is 30 times higher than the number of such instances in its parent node.

To count the same for all instances outside interval [20, 22] and print out the proportions as percents use:

>>> print tree.to_string(leaf_str="%C![20,22] (%^cbP![20,22]%)", node_str=".")

The format string %c![20, 22] denotes the proportion of instances (within the node) whose values are below 20 or above 22. %cbP![20, 22] derives same statistics computed on the parent. A ^ is added for percentages.

root: 439.000 (.%)
|    RM<6.941: 364.000 (98%)
|    |    LSTAT<14.400: 200.000 (93%)
|    |    |    DIS<1.385: 5.000 (127%)
|    |    |    DIS>=1.385: 195.000 (99%)
|    |    LSTAT>=14.400: 164.000 (111%)
|    |    |    CRIM<6.992: 91.000 (96%)
|    |    |    CRIM>=6.992: 73.000 (105%)
|    RM>=6.941: 75.000 (114%)
|    |    RM<7.437: 46.000 (101%)
|    |    |    CRIM<7.393: 43.000 (100%)
|    |    |    CRIM>=7.393: 3.000 (100%)
|    |    RM>=7.437: 29.000 (98%)
|    |    |    TAX<534.500: 29.000 (103%)
|    |    |    TAX>=534.500: 0.000 (0%)

Defining custom printouts¶

TreeClassifier.to_string()‘s argument user_formats can be used to print other information. user_formats should contain a list of tuples with a regular expression and a function to be called when that expression is found in the format string. Expressions from user_formats are checked before the built-in expressions discussed above.

The regular expression should describe a string like used above, for instance %.2DbP. When a leaf or internal node is printed, the format string (leaf_str or node_str) is checked for these regular expressions and when the match is found, the corresponding callback function is called.

The passed function will get five arguments: the format string (leaf_str or node_str), the match object, the node which is being printed, its parent (can be None) and the tree classifier. The function should return the format string in which the part described by the match object (that is, the part that is matched by the regular expression) is replaced by whatever information your callback function is supposed to give.

The function can use several utility function provided in the module.

Orange.classification.tree.insert_str(s, mo, sub)¶: Replace the part of s which is covered by mo with the string sub.

Orange.classification.tree.insert_dot(s, mo)¶: Replace the part of s which is covered by mo with a dot. You should use this when the function cannot compute the desired quantity; it is called, for instance, when it needs to divide by something in the parent, but the parent doesn’t exist.

Orange.classification.tree.insert_num(s, mo, n)¶: Replace the part of s matched by mo with the number n, formatted as specified by the user, that is, it multiplies it by 100, if needed, and prints with the right number of places and decimals. It does so by checking the mo for a group named m100 (representing the ^ in the format string) and a group named fs representing the part giving the number o f decimals (e.g. 5.3).

Orange.classification.tree.by_whom(by, parent, tree)¶: If by equals bp, return parent, else return tree.tree. This is used to find what to divide the quantity with, when division is required.

The module also includes reusable regular expressions:

Orange.classification.tree.fs = '(?P<m100>\\^?)(?P<fs>(\\d*\\.?\\d*)?)'¶: Defines the multiplier by 100 (^) and the format for the number of decimals (e.g. 5.3). The corresponding groups are named m100 and fs.

Orange.classification.tree.by = '(?P<by>(b(P|A)))?'¶: Defines bP or bA or nothing; the result is in groups by.

For a trivial example, %V is implemented with the following tuple:

(re.compile("%V"), replaceV)

And replaceV is defined by:

def replaceV(strg, mo, node, parent, tree):
    return insert_str(strg, mo, str(node.node_classifier.default_value))

replaceV takes the value predicted at the node (node.node_classifier.default_value ), converts it to a string and passes it to insert_str().

A more complex regular expression is the one for the proportion of majority class, defined as "%"+fs+"M"+by. It uses the two partial expressions defined above (fs and by).

The following code prints the classification margin for each node, that is, the difference between the proportion of the largest and the second largest class in the node:

def get_margin(dist):
    if dist.abs < 1e-30:
        return 0
    l = list(dist)
    l.sort()
    return (l[-1] - l[-2]) / dist.abs

def replaceB(strg, mo, node, parent, tree):
    margin = get_margin(node.distribution)

    by = mo.group("by")
    if margin and by:
        whom = Orange.classification.tree.by_whom(by, parent, tree)
        if whom and whom.distribution:
            div_margin = get_margin(whom.distribution)
            if div_margin > 1e-30:
                margin /= div_margin
            else:
                Orange.classification.tree.insert_dot(strg, mo)
        else:
            return Orange.classification.tree.insert_dot(strg, mo)
    return Orange.classification.tree.insert_num(strg, mo, margin)

my_format = [(re.compile("%" + Orange.classification.tree.fs
    + "B" + Orange.classification.tree.by), replaceB)]

get_margin computes the margin from the distribution. The replacing function, replaceB, computes the margin for the node. If by group is present, we call by_whom() to get the node with whose margin this node’s margin is to be divided. If this node (usually the parent) does not exist of if its margin is zero, insert_dot() inserts dot, otherwise insert_num() is called which inserts the number in the user-specified format. my_format contains the regular expression and the callback function.

Printing the tree with

print tree.to_string(leaf_str="%V %^B% (%^3.2BbP%)", user_formats=my_format)

yields:

petal width<0.800: Iris-setosa 100% (100.00%)
petal width>=0.800
|    petal width<1.750
|    |    petal length<5.350: Iris-versicolor 88% (108.57%)
|    |    petal length>=5.350: Iris-virginica 100% (122.73%)
|    petal width>=1.750
|    |    petal length<4.850: Iris-virginica 33% (34.85%)
|    |    petal length>=4.850: Iris-virginica 100% (104.55%)

Plotting with Dot¶

To produce images of trees, first create a .dot file with TreeClassifier.dot(). If it was saved to “tree5.dot”, plot a gif with the following command:

dot -Tgif tree5.dot -otree5.gif

Check GraphViz’s dot documentation for more options and output formats.

C4.5 Tree Inducer¶

C4.5 is, as a standard benchmark in machine learning, incorporated in Orange. The implementation uses the original C4.5 code, so the resulting tree is exactly like the one that would be build by standalone C4.5. The tree build is only made accessible in Python.

C45Learner and C45Classifier behave like any other Orange learner and classifier. Unlike most of Orange learning algorithms, C4.5 does not accepts weighted instances.

Building the C4.5 plug-in¶

Due to copyright restrictions, C4.5 is not distributed with Orange, but it can be added as a plug-in. A C compiler is needed for the procedure: on Windows MS Visual C (CL.EXE and LINK.EXE must be on the PATH), on Linux and OS X gcc (OS X users can download it from Apple).

Orange must be installed prior to building C4.5.

Download C4.5 (Release 8) sources from the Rule Quest’s site and extract them. The files will be modified in the further process.
Download ensemble.c and buildC45.py into the directory R8/Src of the C4.5 sources (this directory contains, for instance, the file average.c).
Run buildC45.py, which will build the plug-in and put it next to orange.pyd (or orange.so on Linux/Mac).
Run python, type import Orange and create Orange.classification.tree.C45Learner(). This should succeed.
Finally, you can remove C4.5 sources.

The buildC45.py creates .h files that wrap Quinlan’s .i files and ensure that they are not included twice. It modifies C4.5 sources to include .h’s instead of .i’s (this step can hardly fail). Then it compiles ensemble.c into c45.dll or c45.so and puts it next to Orange.

class Orange.classification.tree.C45Learner(**kwargs)¶

C45Learner‘s attributes have double names - those that you know from C4.5 command lines and the corresponding names of C4.5’s internal variables. All defaults are set as in C4.5; if you change nothing, you are running C4.5.

Constructs a C45Classifier when given data.

gain_ratio(g)¶: Determines whether to use information gain (false, default) or gain ratio for selection of attributes (true).

batch(b)¶: Turn on batch mode (no windows, no iterations); this option is not documented in C4.5 manuals. It conflicts with “window”, “increment” and “trials”.

subset(s)¶: Enables subsetting (default: false, no subsetting),

prob_thresh(p)¶: Probabilistic threshold for continuous attributes (default: false).

min_objs(m)¶: Minimal number of objects (instances) in leaves (default: 2).

window(w)¶: Initial windows size (default: maximum of 20% and twice the square root of the number of data objects).

increment(i)¶: The maximum number of objects that can be added to the window at each iteration (default: 20% of the initial window size).

cf(c)¶: Prunning confidence level (default: 25%).

trials(t)¶: Set the number of trials in iterative (i.e. non-batch) mode (default: 10).

prune¶: Return pruned tree (not an original C4.5 option) (default: true)

commandline(ln)¶: Set the arguments with a C4.5 command line.

class Orange.classification.tree.C45Classifier(base_classifier)¶

A faithful reimplementation of Quinlan’s C4.5, but uses a tree composed of C45Node instead of C4.5’s original tree structure.

tree¶: C4.5 tree stored as C45Node.

__call__(instance, result_type=0, *args, **kwdargs)¶

Classify a new instance.

Parameters:	instance (`Orange.data.Instance`) – instance to be classified. result_type – `Orange.classification.Classifier.GetValue` or `Orange.classification.Classifier.GetProbabilities` or `Orange.classification.Classifier.GetBoth`
Return type:	`Orange.data.Value`, `Orange.statistics.Distribution` or a tuple with both

to_string()¶

Print the tree in the same form as Ross Quinlan’s C4.5 program.

import Orange

data = Orange.data.Table("voting")
c45 = Orange.classification.tree.C45Learner(data)
print c45

prints

physician-fee-freeze = n: democrat (253.4)
physician-fee-freeze = y:
|   synfuels-corporation-cutback = n: republican (145.7)
|   synfuels-corporation-cutback = y:
|   |   mx-missile = y: democrat (6.0)
|   |   mx-missile = n:
|   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
|   |   |   adoption-of-the-budget-resolution = y:
|   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
|   |   |   |   anti-satellite-test-ban = y: republican (2.2)

The standalone C4.5 would print:

physician-fee-freeze = n: democrat (253.4/5.9)
physician-fee-freeze = y:
|   synfuels-corporation-cutback = n: republican (145.7/6.2)
|   synfuels-corporation-cutback = y:
|   |   mx-missile = y: democrat (6.0/2.4)
|   |   mx-missile = n:
|   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
|   |   |   adoption-of-the-budget-resolution = y:
|   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
|   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)

C4.5 also prints out the number of errors on learning data in each node.

class Orange.classification.tree.C45Node¶

This class is a reimplementation of the corresponding struct from Quinlan’s C4.5 code.

node_type¶: Type of the node: C45Node.Leaf (0), C45Node.Branch (1), C45Node.Cut (2), C45Node.Subset (3). “Leaves” are leaves, “branches” split instances based on values of a discrete attribute, “cuts” cut them according to a threshold value of a continuous attributes and “subsets” use discrete attributes but with subsetting so that several values can go into the same branch.

leaf¶: Value returned by that leaf. The field is defined for internal nodes as well.

items¶: Number of (learning) instances in the node.

class_dist¶: Class distribution for the node (of type Orange.statistics.distribution.Discrete).

tested¶: The attribute used in the node’s test. If node is a leaf, obj:tested is None, if node is of type Branch or Cut tested is a discrete attribute, and if node is of type Cut then tested is a continuous attribute.

cut¶: A threshold for continuous attributes, if node is of type Cut. Undefined otherwise.

mapping¶: Mapping for nodes of type Subset. Element mapping[i] gives the index for an instance whose value of tested is i. Here, i denotes an index of value, not a Orange.data.Value.

branch¶: A list of branches stemming from this node.

Examples¶

This script constructs the same learner as you would get by calling the usual C4.5:

import Orange

iris = Orange.data.Table("iris")
tree = Orange.classification.tree.C45Learner(iris)

print "\n\nC4.5 with default arguments"
for i in iris[:5]:
    print tree(i), i.getclass()

Both C4.5 command-line symbols and variable names can be used. The following lines produce the same result:

tree = Orange.classification.tree.C45Learner(data, m=100)
tree = Orange.classification.tree.C45Learner(data, min_objs=100)

A veteran C4.5 might prefer C45Learner.commandline():

lrn = Orange.classification.tree.C45Learner()
lrn.commandline("-m 1 -s")
tree = lrn(data)

The following script prints out the tree same format as C4.5 does.

def printTree0(node, classvar, lev):
    var = node.tested

    if node.nodeType == 0:
        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),

    elif node.nodeType == 1:
        for i, val in enumerate(var.values):
            print ("\n"+"|    "*lev + "%s = %s:") % (var.name, val),
            printTree0(node.branch[i], classvar, lev+1)

    elif node.nodeType == 2:
        print ("\n"+"|    "*lev + "%s &lt;= %.1f:") % (var.name, node.cut),
        printTree0(node.branch[0], classvar, lev+1)
        print ("\n"+"|    "*lev + "%s > %.1f:") % (var.name, node.cut),
        printTree0(node.branch[1], classvar, lev+1)

    elif node.nodeType == 3:
        for i, branch in enumerate(node.branch):
            inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
            inset = [var.values[j[0]] for j in inset]
            if len(inset)==1:
                print ("\n"+"|    "*lev + "%s = %s:") % (var.name, inset[0]),
            else:
                print ("\n"+"|    "*lev + "%s in {%s}:") % (var.name, 
                    reduce(lambda x,y:x+", "+y, inset)),
            printTree0(branch, classvar, lev+1)

def printTree(tree):
    printTree0(tree.tree, tree.classVar, 0)
    print

For the leaves just the value in node.leaf in printed. Since C45Node does not know to which attribute it belongs, we need to convert it to a string through classvar, which is passed as an extra argument to the recursive part of print_tree.

For discrete splits without subsetting, we print out all attribute values and recursively call the function for all branches. Continuous splits are equally easy to handle.

For discrete splits with subsetting, we iterate through branches, retrieve the corresponding values that go into each branch to inset, turn the values into strings and print them out, separately treating the case when only a single value goes into the branch.

Simple Tree Inducer¶

SimpleTreeLearner is an implementation of regression and classification trees. It is faster than TreeLearner at the expense of flexibility. It uses gain ratio for classification and MSE for regression.

SimpleTreeLearner was developed for speeding up the construction of random forests, but can also be used as a standalone tree learner.

class Orange.classification.tree.SimpleTreeLearner¶

max_majority¶: Maximal proportion of majority class. When this is exceeded, induction stops (only used for classification).

min_instances¶: Minimal number of instances in leaves. Instance count is weighed.

max_depth¶: Maximal depth of tree.

skip_prob¶: At every split an attribute will be skipped with probability skip_prob. Useful for building random forests.

random_generator¶: Provide your own Orange.misc.Random.

Examples¶

SimpleTreeLearner is used in much the same way as TreeLearner. A typical example of using SimpleTreeLearner would be to build a random forest:

import Orange
import time

class SimpleTreeLearnerSetProb():
    """
    Orange.classification.tree.SimpleTreeLearner which sets the skip_prob
    so that on average a square root of the attributes will be 
    randomly choosen for each split.
    """
    def __init__(self, wrapped):
        self.wrapped = wrapped

    def __call__(self, examples, weight=0):
        self.wrapped.skip_prob = 1-len(examples.domain.attributes)**0.5/len(examples.domain.attributes)
        return self.wrapped(examples)

#ordinary random forests
tree = Orange.classification.tree.TreeLearner(min_instances=5, measure="gainRatio")
rf_def = Orange.ensemble.forest.RandomForestLearner(trees=50, base_learner=tree, name="for_gain")

#random forests with simple trees - simple trees do random attribute selection by themselves
st = Orange.classification.tree.SimpleTreeLearner(min_instances=5)
stp = SimpleTreeLearnerSetProb(st)
rf_simple = Orange.ensemble.forest.RandomForestLearner(learner=stp, trees=50, name="for_simp")

learners = [ rf_def, rf_simple ]

iris = Orange.data.Table("iris")
results = Orange.evaluation.testing.cross_validation(learners, iris, folds=3)
print "Learner  CA     Brier  AUC"
for i in range(len(learners)):
    print "%-8s %5.3f  %5.3f  %5.3f" % (learners[i].name, \
    Orange.evaluation.scoring.CA(results)[i], 
    Orange.evaluation.scoring.Brier_score(results)[i],
    Orange.evaluation.scoring.AUC(results)[i])

print 

print "Runtimes:"
for l in learners:
    t = time.time()
    l(iris)
    print l.name, time.time() - t

References¶

Bratko, I. (2002). Prolog Programming for Artificial Intelligence, Addison Wesley, 2002.

E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories, Murray Hill NJ, U.S.A., October 1993.

Graphviz - open source graph drawing software A home page of AT&T’s dot and similar software packages.

“”“

“”” TODO C++ aliases

SplitConstructor.discrete/continuous_split_constructor -> SplitConstructor.discrete Node.examples -> Node.instances

Classification trees (tree)¶

Component-based Tree Inducer¶

Examples¶

Tree Structure¶

Setting learning parameters¶

Redefining tree induction components¶

Learner and Classifier Components¶

Split constructors¶

StopCriteria and StopCriteria_common¶

Splitters¶

Descenders¶

Pruning¶

Printing the tree¶

Format string¶

Defining custom printouts¶

Plotting with Dot¶

C4.5 Tree Inducer¶

Building the C4.5 plug-in¶

Examples¶

Simple Tree Inducer¶

Examples¶

References¶

Classification trees (`tree`)¶