Logistic regression (logreg)

Logistic regression is a statistical classification method that fits data to a logistic function. Orange provides various enhancement of the method, such as stepwise selection of variables and handling of constant variables and singularities.

class Orange.classification.logreg.LogRegLearner(remove_singular=0, fitter=None, **kwds)

Logistic regression learner.

Returns either a learning algorithm (instance of LogRegLearner) or, if data is provided, a fitted model (instance of LogRegClassifier).

Parameters:
  • data (Orange.data.Table) – data table; it may contain discrete and continuous features
  • weight_id (int) – the ID of the weight meta attribute
  • remove_singular (bool) – automated removal of constant features and singularities (default: False)
  • fitter – the fitting algorithm (default: LogRegFitter_Cholesky)
  • stepwise_lr (bool) – enables stepwise feature selection (default: False)
  • add_crit (float) – threshold for adding a feature in stepwise selection (default: 0.2)
  • delete_crit (float) – threshold for removing a feature in stepwise selection (default: 0.3)
  • num_features (int) – number of features in stepwise selection (default: -1, no limit)
Return type:

LogRegLearner or LogRegClassifier

__call__(data, weight=0)

Fit a model to the given data.

Parameters:
  • data (Table) – Data instances.
  • weight (int) – Id of meta attribute with instance weights
Return type:

LogRegClassifier

class Orange.classification.logreg.LogRegClassifier

A logistic regression classification model. Stores estimated values of regression coefficients and their significances, and uses them to predict classes and class probabilities.

beta

Estimated regression coefficients.

beta_se

Estimated standard errors for regression coefficients.

wald_Z

Wald Z statistics for beta coefficients. Wald Z is computed as beta/beta_se.

P

List of P-values for beta coefficients, that is, the probability that beta coefficients differ from 0.0. The probability is computed from squared Wald Z statistics that is distributed with chi-squared distribution.

likelihood

The likelihood of the sample (ie. learning data) given the fitted model.

fit_status

Tells how the model fitting ended, either regularly (LogRegFitter.OK), or it was interrupted due to one of beta coefficients escaping towards infinity (LogRegFitter.Infinity) or since the values did not converge (LogRegFitter.Divergence).

Although the model is functional in all cases, it is recommended to inspect whether the coefficients of the model if the fitting did not end normally.

__call__(instance, result_type)

Classify a new instance.

Parameters:
  • instance (Instance) – instance to be classified.
  • result_typeGetValue or GetProbabilities or GetBoth
Return type:

Value, Distribution or a tuple with both

class Orange.classification.logreg.LogRegFitter

LogRegFitter is the abstract base class for logistic fitters. Fitters can be called with a data table and return a vector of coefficients and the corresponding statistics, or a status signifying an error. The possible statuses are

OK

Optimization converged

Infinity

Optimization failed due to one or more beta coefficients escaping towards infinity.

Divergence

Beta coefficients failed to converge, but without any of beta coefficients escaping toward infinity.

Constant

The data is singular due to a constant variable.

Singularity

The data is singular.

__call__(data, weight_id)

Fit the model and return a tuple with the fitted values and the corresponding statistics or an error indicator. The two cases differ by the tuple length and the status (the first tuple element).

(status, beta, beta_se, likelihood) Fitting succeeded. The
first element, status is either OK, Infinity or Divergence. In the latter cases, returned values may still be useful for making predictions, but it is recommended to inspect the coefficients and their errors and decide whether to use the model or not.
(status, variable)
The fitter failed due to the indicated variable. status is either Constant or Singularity.

The proper way of calling the fitter is to handle both scenarios

res = fitter(examples)
if res[0] in [fitter.OK, fitter.Infinity, fitter.Divergence]:
   status, beta, beta_se, likelihood = res
   < proceed by doing something with what you got >
else:
   status, attr = res
   < remove the attribute or complain to the user or ... >
class Orange.classification.logreg.LogRegFitter_Cholesky

The sole fitter available at the moment. This is a C++ translation of Alan Miller’s logistic regression code that uses Newton-Raphson algorithm to iteratively minimize least squares error computed from training data.

class Orange.classification.logreg.StepWiseFSS(add_crit=0.2, delete_crit=0.3, num_features=-1, **kwds)

Bases: object

A learning algorithm for logistic regression that implements a stepwise feature subset selection as described in Applied Logistic Regression (Hosmer and Lemeshow, 2000).

Each step of the algorithm is composed of two parts. The first is backward elimination in which the least significant variable in the model is removed if its p-value is above the prescribed threshold delete_crit. The second step is forward selection in which all variables are tested for addition to the model, and the one with the most significant contribution is added if the corresponding p-value is smaller than the prescribed :obj:d`add_crit`. The algorithm stops when no more variables can be added or removed.

The model can be additionaly constrained by setting num_features to a non-negative value. The algorithm will then stop when the number of variables exceeds the given limit.

Significances are assesed by the likelihood ratio chi-square test. Normal F test is not appropriate since the errors are assumed to follow a binomial distribution.

The class constructor returns an instance of learning algorithm or, if given training data, a list of selected variables.

Parameters:
  • table (Orange.data.Table) – training data.
  • add_crit (float) – threshold for adding a variable (default: 0.2)
  • delete_crit (float) – threshold for removing a variable (default: 0.3); should be higher than add_crit.
  • num_features (int) – maximum number of selected features, use -1 for infinity.
Return type:

StepWiseFSS or list of features

Orange.classification.logreg.dump(classifier)

Return a formatted string describing the logistic regression model

Parameters:classifier – logistic regression classifier.
class Orange.classification.logreg.LibLinearLogRegLearner(solver_type=L2R_LR, C=1, eps=0.01, normalization=True, bias=-1, multinomial_treatment=NValues, **kwargs)

A logistic regression learner from LIBLINEAR.

Supports L2 regularized learning.

Note

Unlike LogRegLearner this one supports multi-class classification using one vs. rest strategy.

__init__(solver_type=L2R_LR, C=1, eps=0.01, normalization=True, bias=-1, multinomial_treatment=NValues, **kwargs)
Parameters:
  • solver_type – One of the following class constants: L2_LR, L2_LR_DUAL, L1R_LR.
  • C (float) – Regularization parameter (default 1.0). Higher values of C mean less regularization (C is a coefficient for the loss function).
  • eps (float) – Stopping criteria (default 0.01)
  • normalization (bool) – Normalize the input data prior to learning (default True)
  • bias (float) – If positive, use it as a bias (default -1).
  • multinomial_treatment (int) – Defines how to handle multinomial features for learning. It can be one of the DomainContinuizer multinomial_treatment constants (default: DomainContinuizer.NValues).

New in version 2.6.1: Added multinomial_treatment

__call__(data, weight_id=None)

Return a classifier trained on the data (weight_id is ignored).

Parameters:
Rval:

Orange.core.LinearClassifier

Note

The Orange.core.LinearClassifier is same class as Orange.classification.svm.LinearClassifier.

Examples

The first example shows a straightforward use a logistic regression (logreg-run.py).

import Orange

titanic = Orange.data.Table("titanic")
lr = Orange.classification.logreg.LogRegLearner(titanic)

# compute classification accuracy
correct = 0.0
for ex in titanic:
    if lr(ex) == ex.getclass():
        correct += 1
print "Classification accuracy:", correct / len(titanic)
Orange.classification.logreg.dump(lr)

Result:

Classification accuracy: 0.778282598819

class attribute = survived
class values = <no, yes>

    Attribute       beta  st. error     wald Z          P OR=exp(beta)

    Intercept      -1.23       0.08     -15.15      -0.00
 status=first       0.86       0.16       5.39       0.00       2.36
status=second      -0.16       0.18      -0.91       0.36       0.85
 status=third      -0.92       0.15      -6.12       0.00       0.40
    age=child       1.06       0.25       4.30       0.00       2.89
   sex=female       2.42       0.14      17.04       0.00      11.25

The next examples shows how to handle singularities in data sets (logreg-singularities.py).

import Orange

adult = Orange.data.Table("adult_sample")
lr = Orange.classification.logreg.LogRegLearner(adult, remove_singular=1)

for ex in adult[:5]:
    print ex.getclass(), lr(ex)
Orange.classification.logreg.dump(lr)

The first few lines of the output of this script are:

<=50K <=50K
<=50K <=50K
<=50K <=50K
>50K >50K
<=50K >50K

class attribute = y
class values = <>50K, <=50K>

                           Attribute       beta  st. error     wald Z          P OR=exp(beta)

                           Intercept       6.62      -0.00       -inf       0.00
                                 age      -0.04       0.00       -inf       0.00       0.96
                              fnlwgt      -0.00       0.00       -inf       0.00       1.00
                       education-num      -0.28       0.00       -inf       0.00       0.76
             marital-status=Divorced       4.29       0.00        inf       0.00      72.62
        marital-status=Never-married       3.79       0.00        inf       0.00      44.45
            marital-status=Separated       3.46       0.00        inf       0.00      31.95
              marital-status=Widowed       3.85       0.00        inf       0.00      46.96
marital-status=Married-spouse-absent       3.98       0.00        inf       0.00      53.63
    marital-status=Married-AF-spouse       4.01       0.00        inf       0.00      55.19
             occupation=Tech-support      -0.32       0.00       -inf       0.00       0.72

If remove_singular is set to 0, inducing a logistic regression classifier returns an error:

Traceback (most recent call last):
  File "logreg-singularities.py", line 4, in <module>
    lr = classification.logreg.LogRegLearner(table, removeSingular=0)
  File "/home/jure/devel/orange/Orange/classification/logreg.py", line 255, in LogRegLearner
    return lr(examples, weightID)
  File "/home/jure/devel/orange/Orange/classification/logreg.py", line 291, in __call__
    lr = learner(examples, weight)
orange.KernelException: 'orange.LogRegLearner': singularity in workclass=Never-worked

The attribute variable which causes the singularity is workclass.

The example below shows how the use of stepwise logistic regression can help to gain in classification performance (logreg-stepwise.py):

import Orange

ionosphere = Orange.data.Table("ionosphere.tab")

lr = Orange.classification.logreg.LogRegLearner(remove_singular=1)
learners = (
  Orange.classification.logreg.LogRegLearner(name='logistic',
      remove_singular=1),
  Orange.feature.selection.FilteredLearner(lr,
     filter=Orange.classification.logreg.StepWiseFSSFilter(add_crit=0.05,
         delete_crit=0.9), name='filtered')
)
results = Orange.evaluation.testing.cross_validation(learners, ionosphere, store_classifiers=1)

# output the results
print "Learner      CA"
for i in range(len(learners)):
    print "%-12s %5.3f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i])

# find out which features were retained by filtering

print "\nNumber of times features were used in cross-validation:"
features_used = {}
for i in range(10):
    for a in results.classifiers[i][1].atts():
        if a.name in features_used.keys():
            features_used[a.name] += 1
        else:
            features_used[a.name] = 1
for k in features_used:
    print "%2d x %s" % (features_used[k], k)

The output of this script is:

Learner      CA
logistic     0.841
filtered     0.846

Number of times attributes were used in cross-validation:
 1 x a21
10 x a22
 8 x a23
 7 x a24
 1 x a25
10 x a26
10 x a27
 3 x a28
 7 x a29
 9 x a31
 2 x a16
 7 x a12
 1 x a32
 8 x a15
10 x a14
 4 x a17
 7 x a30
10 x a11
 1 x a10
 1 x a13
10 x a34
 2 x a19
 1 x a18
10 x a3
10 x a5
 4 x a4
 4 x a7
 8 x a6
10 x a9
10 x a8