Logistic regression (logreg)¶
Logistic regression is a statistical classification method that fits data to a logistic function. Orange provides various enhancement of the method, such as stepwise selection of variables and handling of constant variables and singularities.
- class Orange.classification.logreg.LogRegLearner(remove_singular=0, fitter=None, **kwds)¶
Logistic regression learner.
Returns either a learning algorithm (instance of LogRegLearner) or, if data is provided, a fitted model (instance of LogRegClassifier).
Parameters: - data (Orange.data.Table) – data table; it may contain discrete and continuous features
- weight_id (int) – the ID of the weight meta attribute
- remove_singular (bool) – automated removal of constant features and singularities (default: False)
- fitter – the fitting algorithm (default: LogRegFitter_Cholesky)
- stepwise_lr (bool) – enables stepwise feature selection (default: False)
- add_crit (float) – threshold for adding a feature in stepwise selection (default: 0.2)
- delete_crit (float) – threshold for removing a feature in stepwise selection (default: 0.3)
- num_features (int) – number of features in stepwise selection (default: -1, no limit)
Return type: - __call__(data, weight=0)¶
Fit a model to the given data.
Parameters: Return type:
- class Orange.classification.logreg.LogRegClassifier¶
A logistic regression classification model. Stores estimated values of regression coefficients and their significances, and uses them to predict classes and class probabilities.
- beta¶
Estimated regression coefficients.
- beta_se¶
Estimated standard errors for regression coefficients.
- wald_Z¶
Wald Z statistics for beta coefficients. Wald Z is computed as beta/beta_se.
- P¶
List of P-values for beta coefficients, that is, the probability that beta coefficients differ from 0.0. The probability is computed from squared Wald Z statistics that is distributed with chi-squared distribution.
- likelihood¶
The likelihood of the sample (ie. learning data) given the fitted model.
- fit_status¶
Tells how the model fitting ended, either regularly (LogRegFitter.OK), or it was interrupted due to one of beta coefficients escaping towards infinity (LogRegFitter.Infinity) or since the values did not converge (LogRegFitter.Divergence).
Although the model is functional in all cases, it is recommended to inspect whether the coefficients of the model if the fitting did not end normally.
- __call__(instance, result_type)¶
Classify a new instance.
Parameters: - instance (Instance) – instance to be classified.
- result_type – GetValue or GetProbabilities or GetBoth
Return type: Value, Distribution or a tuple with both
- class Orange.classification.logreg.LogRegFitter¶
LogRegFitter is the abstract base class for logistic fitters. Fitters can be called with a data table and return a vector of coefficients and the corresponding statistics, or a status signifying an error. The possible statuses are
- OK¶
Optimization converged
- Infinity¶
Optimization failed due to one or more beta coefficients escaping towards infinity.
- Divergence¶
Beta coefficients failed to converge, but without any of beta coefficients escaping toward infinity.
- Constant¶
The data is singular due to a constant variable.
- Singularity¶
The data is singular.
- __call__(data, weight_id)¶
Fit the model and return a tuple with the fitted values and the corresponding statistics or an error indicator. The two cases differ by the tuple length and the status (the first tuple element).
- (status, beta, beta_se, likelihood) Fitting succeeded. The
- first element, status is either OK, Infinity or Divergence. In the latter cases, returned values may still be useful for making predictions, but it is recommended to inspect the coefficients and their errors and decide whether to use the model or not.
- (status, variable)
- The fitter failed due to the indicated variable. status is either Constant or Singularity.
The proper way of calling the fitter is to handle both scenarios
res = fitter(examples) if res[0] in [fitter.OK, fitter.Infinity, fitter.Divergence]: status, beta, beta_se, likelihood = res < proceed by doing something with what you got > else: status, attr = res < remove the attribute or complain to the user or ... >
- class Orange.classification.logreg.LogRegFitter_Cholesky¶
The sole fitter available at the moment. This is a C++ translation of Alan Miller’s logistic regression code that uses Newton-Raphson algorithm to iteratively minimize least squares error computed from training data.
- class Orange.classification.logreg.StepWiseFSS(add_crit=0.2, delete_crit=0.3, num_features=-1, **kwds)¶
Bases: object
A learning algorithm for logistic regression that implements a stepwise feature subset selection as described in Applied Logistic Regression (Hosmer and Lemeshow, 2000).
Each step of the algorithm is composed of two parts. The first is backward elimination in which the least significant variable in the model is removed if its p-value is above the prescribed threshold delete_crit. The second step is forward selection in which all variables are tested for addition to the model, and the one with the most significant contribution is added if the corresponding p-value is smaller than the prescribed :obj:d`add_crit`. The algorithm stops when no more variables can be added or removed.
The model can be additionaly constrained by setting num_features to a non-negative value. The algorithm will then stop when the number of variables exceeds the given limit.
Significances are assesed by the likelihood ratio chi-square test. Normal F test is not appropriate since the errors are assumed to follow a binomial distribution.
The class constructor returns an instance of learning algorithm or, if given training data, a list of selected variables.
Parameters: - table (Orange.data.Table) – training data.
- add_crit (float) – threshold for adding a variable (default: 0.2)
- delete_crit (float) – threshold for removing a variable (default: 0.3); should be higher than add_crit.
- num_features (int) – maximum number of selected features, use -1 for infinity.
Return type: StepWiseFSS or list of features
- Orange.classification.logreg.dump(classifier)¶
Return a formatted string describing the logistic regression model
Parameters: classifier – logistic regression classifier.
- class Orange.classification.logreg.LibLinearLogRegLearner(solver_type=L2R_LR, C=1, eps=0.01, normalization=True, bias=-1, multinomial_treatment=NValues, **kwargs)¶
A logistic regression learner from LIBLINEAR.
Supports L2 regularized learning.
Note
Unlike LogRegLearner this one supports multi-class classification using one vs. rest strategy.
- __init__(solver_type=L2R_LR, C=1, eps=0.01, normalization=True, bias=-1, multinomial_treatment=NValues, **kwargs)¶
Parameters: - solver_type – One of the following class constants: L2_LR, L2_LR_DUAL, L1R_LR.
- C (float) – Regularization parameter (default 1.0). Higher values of C mean less regularization (C is a coefficient for the loss function).
- eps (float) – Stopping criteria (default 0.01)
- normalization (bool) – Normalize the input data prior to learning (default True)
- bias (float) – If positive, use it as a bias (default -1).
- multinomial_treatment (int) – Defines how to handle multinomial features for learning. It can be one of the DomainContinuizer multinomial_treatment constants (default: DomainContinuizer.NValues).
New in version 2.6.1: Added multinomial_treatment
- __call__(data, weight_id=None)¶
Return a classifier trained on the data (weight_id is ignored).
Parameters: - data (Orange.data.Table) – Training data set.
- weight_id (int) – Ignored.
Rval: Orange.core.LinearClassifier
Note
The Orange.core.LinearClassifier is same class as Orange.classification.svm.LinearClassifier.
Examples¶
The first example shows a straightforward use a logistic regression (logreg-run.py).
import Orange
titanic = Orange.data.Table("titanic")
lr = Orange.classification.logreg.LogRegLearner(titanic)
# compute classification accuracy
correct = 0.0
for ex in titanic:
if lr(ex) == ex.getclass():
correct += 1
print "Classification accuracy:", correct / len(titanic)
Orange.classification.logreg.dump(lr)
Result:
Classification accuracy: 0.778282598819
class attribute = survived
class values = <no, yes>
Attribute beta st. error wald Z P OR=exp(beta)
Intercept -1.23 0.08 -15.15 -0.00
status=first 0.86 0.16 5.39 0.00 2.36
status=second -0.16 0.18 -0.91 0.36 0.85
status=third -0.92 0.15 -6.12 0.00 0.40
age=child 1.06 0.25 4.30 0.00 2.89
sex=female 2.42 0.14 17.04 0.00 11.25
The next examples shows how to handle singularities in data sets (logreg-singularities.py).
import Orange
adult = Orange.data.Table("adult_sample")
lr = Orange.classification.logreg.LogRegLearner(adult, remove_singular=1)
for ex in adult[:5]:
print ex.getclass(), lr(ex)
Orange.classification.logreg.dump(lr)
The first few lines of the output of this script are:
<=50K <=50K
<=50K <=50K
<=50K <=50K
>50K >50K
<=50K >50K
class attribute = y
class values = <>50K, <=50K>
Attribute beta st. error wald Z P OR=exp(beta)
Intercept 6.62 -0.00 -inf 0.00
age -0.04 0.00 -inf 0.00 0.96
fnlwgt -0.00 0.00 -inf 0.00 1.00
education-num -0.28 0.00 -inf 0.00 0.76
marital-status=Divorced 4.29 0.00 inf 0.00 72.62
marital-status=Never-married 3.79 0.00 inf 0.00 44.45
marital-status=Separated 3.46 0.00 inf 0.00 31.95
marital-status=Widowed 3.85 0.00 inf 0.00 46.96
marital-status=Married-spouse-absent 3.98 0.00 inf 0.00 53.63
marital-status=Married-AF-spouse 4.01 0.00 inf 0.00 55.19
occupation=Tech-support -0.32 0.00 -inf 0.00 0.72
If remove_singular is set to 0, inducing a logistic regression classifier returns an error:
Traceback (most recent call last):
File "logreg-singularities.py", line 4, in <module>
lr = classification.logreg.LogRegLearner(table, removeSingular=0)
File "/home/jure/devel/orange/Orange/classification/logreg.py", line 255, in LogRegLearner
return lr(examples, weightID)
File "/home/jure/devel/orange/Orange/classification/logreg.py", line 291, in __call__
lr = learner(examples, weight)
orange.KernelException: 'orange.LogRegLearner': singularity in workclass=Never-worked
The attribute variable which causes the singularity is workclass.
The example below shows how the use of stepwise logistic regression can help to gain in classification performance (logreg-stepwise.py):
import Orange
ionosphere = Orange.data.Table("ionosphere.tab")
lr = Orange.classification.logreg.LogRegLearner(remove_singular=1)
learners = (
Orange.classification.logreg.LogRegLearner(name='logistic',
remove_singular=1),
Orange.feature.selection.FilteredLearner(lr,
filter=Orange.classification.logreg.StepWiseFSSFilter(add_crit=0.05,
delete_crit=0.9), name='filtered')
)
results = Orange.evaluation.testing.cross_validation(learners, ionosphere, store_classifiers=1)
# output the results
print "Learner CA"
for i in range(len(learners)):
print "%-12s %5.3f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i])
# find out which features were retained by filtering
print "\nNumber of times features were used in cross-validation:"
features_used = {}
for i in range(10):
for a in results.classifiers[i][1].atts():
if a.name in features_used.keys():
features_used[a.name] += 1
else:
features_used[a.name] = 1
for k in features_used:
print "%2d x %s" % (features_used[k], k)
The output of this script is:
Learner CA
logistic 0.841
filtered 0.846
Number of times attributes were used in cross-validation:
1 x a21
10 x a22
8 x a23
7 x a24
1 x a25
10 x a26
10 x a27
3 x a28
7 x a29
9 x a31
2 x a16
7 x a12
1 x a32
8 x a15
10 x a14
4 x a17
7 x a30
10 x a11
1 x a10
1 x a13
10 x a34
2 x a19
1 x a18
10 x a3
10 x a5
4 x a4
4 x a7
8 x a6
10 x a9
10 x a8