Regression trees (tree)

Regression tree shares its implementation with Orange.classification.tree.TreeLearner, but uses a different set of functions to evaluate node splitting and stop criteria. Usage of regression trees is straightforward as demonstrated on the following example (regression-tree-run.py):

import Orange
servo = Orange.data.Table("servo.tab")
tree = Orange.regression.tree.TreeLearner(servo)
print tree
class Orange.regression.tree.TreeLearner(**kw)

A classification or regression tree learner. If a set of instances is given on initialization, a TreeClassifier is built and returned instead.

The learning algorithm has a large number of parameters. The class provides reasonable defaults; they can be modified either as attributes or as arguments given to the constructor.

The algorithm is very flexible, yet slower than the other two implementations that are more suitable for large scale experiments.

The tree induction process

  1. The learning instances are copied, unless store_instances is False and the instance already are stored in a Table.
  2. Apriori class probabilities are computed. A list of candidate features for the split is compiled; in the beginning, all features are candidates.
  3. The recursive part. The contingency matrix is computed by contingency_computer. Contingencies are used by split, stop and splitter.
  4. If the induction should stop, a node_classifier is built by calling node_learner with the given instances, weight ID and the contingency matrix. As the learner uses contingencies whenever possible, the contingency_computer will affect the node_classifier. The node is returned.
  5. If the induction continues, a split is called. If split fails to return a branch selector, induction stops and the Node is returned.
  6. The feature spent (if any) is removed from the candidate list.
  7. Instances are divided into child nodes with splitter. The process recursively continues with step 3 for each of the non-empty subsets. If the splitter returned weights, they are used for each branch.

Attributes

node_learner

Induces a classifier from instances in a node. It is used both for internal nodes and leaves. The default is Orange.classification.majority.MajorityLearner.

descender

The descender that the induced TreeClassifier will use. The default is Descender_UnknownMergeAsSelector. It votes with the branch_selector‘s distribution for vote weights.

contingency_computer

Defines the computation of contingency matrices (used by split, stop, splitter). It can be used, for example, to change the treatment of unknown values. By default ordinary contingency matrices are computed.

Split construction

split

A SplitConstructor or a function with the same signature as SplitConstructor.__call__. It is useful for prototyping new tree induction algorithms. If split is defined, other arguments that affect split construction are ignored. These include binarization, measure, worst_acceptable and min_subset. Default: SplitConstructor_Combined with separate constructors for discrete and continuous features. Discrete features are used as they are, while continuous are binarized. Features are scored with gain ratio. At least two instances in a leaf are required for discrete and five for continuous features.

binarization

If 1, SplitConstructor_ExhaustiveBinary is used. If 2, use SplitConstructor_OneAgainstOthers. If 0, do not use binarization (use SplitConstructor_Feature). Default: 0.

measure

A score to evaluate features for splitting instances in a node. A subclass of Orange.feature.scoring.Score (perhaps InfoGain, GainRatio, Gini, Relief, or MSE). Default: Orange.feature.scoring.GainRatio.

relief_m, relief_k

Set m and k for Relief, if chosen.

splitter

Splitter or a function with the same signature as Splitter.__call__. The default is Splitter_UnknownsAsSelector that splits the learning instances according to distributions given by the selector.

Pruning

worst_acceptable

The lowest required feature score. If the score of the best feature is below this margin, the tree is not grown further (default: 0).

min_subset

The lowest required number of instances in non-null leaves (default: 0).

min_instances

Data subsets with less than min_instances instances are not split any further. Therefore, all leaves in the tree will contain at least that many instances (default: 0).

max_depth

Maximal tree depth. If 0, only root is generated. The default is 100.

max_majority

Induction stops when the proportion of majority class in the node exceeds the value set by this parameter (default: 1.0).

stop

StopCriteria or a function with the same signature as StopCriteria.__call__. Useful for prototyping new tree induction algorithms. When used, parameters max_majority and min_instances will not be considered. The default stopping criterion stops induction when all instances in a node belong to the same class.

m_pruning

If non-zero, invokes an error-based bottom-up post-pruning, where m-estimate is used to estimate class probabilities (default: 0).

same_majority_pruning

If true, invokes a bottom-up post-pruning by removing the subtrees of which all leaves classify to the same class (default: False).

Record keeping

store_distributions
store_contingencies
store_instances
store_node_classifier

Determines whether to store class distributions, contingencies and instances in Node, and whether the Node.node_classifier should be build for internal nodes also (it is needed by the Descender or for post-pruning). Not storing distributions but storing contingencies does not save any memory, since distributions actually points to the same distribution that is stored in contingency.classes. By default everything except store_instances is enabled.

__call__(instances, weight=0)

Return a classifier from the given instances.

instance()

DEPRECATED. Return a base learner - an object of _TreeLearner. This method is left for backwards compatibility.

class Orange.regression.tree.TreeClassifier(base_classifier=None)

Classifies instances according to the tree stored in tree.

The classification process

TreeClassifier uses the descender to descend the instance from the root. If the descender returns only a Node and no distribution, the descend should stop as the node was unambiguously selected. The node’s node_classifier decides the class.

If the descender returns a Node and a distribution, as it happens, for example, if the instance’s value for the Node‘s feature is unknown, the same process repeats for all subtrees and their predictions are combined.

Attributes

tree

The root of the tree, as a Node.

descender

A Descender used to descend an instance from the root as deeply as possible according to the instance’s feature values.

__call__(instance, result_type=0, *args, **kwdargs)

Classify a new instance.

Parameters:
  • instance (Orange.data.Instance) – instance to be classified.
  • result_typeOrange.classification.Classifier.GetValue or Orange.classification.Classifier.GetProbabilities or Orange.classification.Classifier.GetBoth
Return type:

Orange.data.Value, Orange.statistics.Distribution or a tuple with both

count_leaves()

Return the number of leaves.

count_nodes()

Return the number of nodes.

dot(file_name, leaf_str=, node_str=, leaf_shape=plaintext, node_shape=plaintext, user_formats=, []min_instances=0, max_depth=10000000000, simple_first=True)

Print the tree to a file in a format used by GraphViz. Uses the same parameters as to_string() plus two which define the shape of internal nodes and leaves of the tree:

Parameters:
  • leaf_shape (string) – Shape of the outline around leaves of the tree. If “plaintext”, no outline is used (default: “plaintext”).
  • node_shape (string) – Shape of the outline around internal nodes of the tree. If “plaintext”, no outline is used (default: “plaintext”)

Check Polygon-based Nodes for various outlines supported by GraphViz.

to_string(leaf_str=, node_str=, user_formats=, []min_instances=0, max_depth=10000000000, simple_first=True)

Return a string representation of a tree.

Parameters:
  • leaf_str (string) – The format string for the tree leaves. If left empty, "%V (%^.2m%)" will be used for classification trees and "%V" for regression trees.
  • node_str (string) – The format string for the internal nodes. If left empty (as it is by default), no data is printed out for internal nodes. If set to ".", the same string is used as for leaves.
  • max_depth (integer) – If set, it limits the depth to which the tree is printed out.
  • min_instances (integer) – If set, the subtrees with less than the given number of examples are not printed.
  • simple_first (boolean) – If True (default), the branches with a single node are printed before the branches with larger subtrees. If False, the branches are printed in order of appearance.
  • user_formats – A list of regular expressions and callback function through which the user can print out other specific information in the nodes.

SimpleTreeLearner

SimpleTreeLearner is an implementation of regression and classification trees. It is faster than TreeLearner at the expense of flexibility. It uses gain ratio for classification and MSE for regression.

SimpleTreeLearner was developed for speeding up the construction of random forests, but can also be used as a standalone tree learner.

class Orange.regression.tree.SimpleTreeLearner
max_majority

Maximal proportion of majority class. When this is exceeded, induction stops (only used for classification).

min_instances

Minimal number of instances in leaves. Instance count is weighed.

max_depth

Maximal depth of tree.

skip_prob

At every split an attribute will be skipped with probability skip_prob. Useful for building random forests.

random_generator

Provide your own Orange.misc.Random.

A basic example of using a SimpleTreeLearner is shown below:

import Orange

housing = Orange.data.Table("housing.tab")
learner = Orange.regression.tree.SimpleTreeLearner
res = Orange.evaluation.testing.cross_validation([learner], housing)
print Orange.evaluation.scoring.MSE(res)[0]