Continuization (continuization)

Continuization refers to transformation of discrete (binary or multinominal) variables to continuous. The class described below operates on the entire domain; documentation on Orange.core.transformvalue.rst explains how to treat each variable separately.

class Orange.data.continuization.DomainContinuizer

Returns a new domain containing only continuous attributes given a domain or data table. Some options are available only if the data is provided.

The attributes are treated according to their type:

  • continuous variables can be normalized or left unchanged
  • discrete attribute with less than two possible values are removed;
  • binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables
  • multinomial variables are treated according to the flag multinomial_treatment.

The typical use of the class is as follows:

continuizer = Orange.data.continuization.DomainContinuizer()
continuizer.multinomial_treatment = continuizer.LowestIsBase
domain0 = continuizer(data)
data0 = data.translate(domain0)
zero_based

Determines the value used as the “low” value of the variable. When binary variables are transformed into continuous or when multivalued variable is transformed into multiple variables, the transformed variable can either have values 0.0 and 1.0 (default, zero_based is True) or -1.0 and 1.0 (zero_based is False). The following text assumes the default case.

multinomial_treatment

Decides the treatment of multinomial variables. Let N be the number of the variables’s values.

DomainContinuizer.NValues

The variable is replaced by N indicator variables, each corresponding to one value of the original variable. In other words, for each value of the original attribute, only the corresponding new attribute will have a value of 1 and others will be zero.

Note that these variables are not independent, so they cannot be used (directly) in, for instance, linear or logistic regression.

For example, data set “bridges” has feature “RIVER” with values “M”, “A”, “O” and “Y”, in that order. Its value for the 15th row is “M”. Continuization replaces the variable with variables “RIVER=M”, “RIVER=A”, “RIVER=O” and “RIVER=Y”. For the 15th row, the first has value 1 and others are 0.

DomainContinuizer.LowestIsBase

Similar to the above except that it creates only N-1 variables. The missing indicator belongs to the lowest value: when the original variable has the lowest value all indicators are 0.

If the variable descriptor has the base_value defined, the specified value is used as base instead of the lowest one.

Continuizing the variable “RIVER” gives similar results as above except that it would omit “RIVER=M”; all three variables would be zero for the 15th data instance.

DomainContinuizer.FrequentIsBase

Like above, except that the most frequent value is used as the base (this can again be overidden by setting the descriptor’s base_value). If there are multiple most frequent values, the one with the lowest index is used. The frequency of values is extracted from data, so this option cannot be used if constructor is given only a domain.

Variable “RIVER” would be continuized similarly to above except that it omits “RIVER=A”, which is the most frequent value.

DomainContinuizer.Ignore
Multivalued variables are omitted.
DomainContinuizer.ReportError
Raise an error if there are any multinominal variables in the data.
DomainContinuizer.AsOrdinal
Multivalued variables are treated as ordinal and replaced by a continuous variables with the values’ index, e.g. 0, 1, 2, 3...
DomainContinuizer.AsNormalizedOrdinal
As above, except that the resulting continuous value will be from range 0 to 1, e.g. 0, 0.25, 0.5, 0.75, 1 for a five-valued variable.
normalize_continuous

If False (default), continues variables are left unchanged. If True, they are replaced with normalized values by subtracting the average value and dividing by the deviation. Statistics are computed from the data, so constructor must be given data, not just domain.