数值特征¶

特征的处理方式取决于其语义，例如数值、类别、布尔或文本。如果未指定语义，则会自动推断。例如，浮点和整数特征会被检测为数值，而字符串则会被检测为类别。

数值特征可以表示数量或计数。例如，一个人的年龄，或袋子里的物品数量。缺失的数值用 math.nan 表示。

我们来看一个浮点特征的训练示例。

In [2]

已复制！

import ydf
import pandas as pd
import ydf import pandas as pd

In [3]

已复制！

dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [0.1, 0.8, 0.9, 0.1],
})

model = ydf.RandomForestLearner(label="label").train(dataset)
dataset = pd.DataFrame({ "label": [True, False, True, False], "feature_1": [1, 2, 2, 1], "feature_2": [0.1, 0.8, 0.9, 0.1], }) model = ydf.RandomForestLearner(label="label").train(dataset)

Train model on 4 examples
Model trained in 0:00:00.009728

我们可以在数据规范 (Dataspec) 标签页中看到该特征被检测为数值。

In [3]

已复制！

model.describe()
model.describe()

Out[3]

名称 : RANDOM_FOREST
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	NUMERICAL: 2 (66.6667%)
	CATEGORICAL: 1 (33.3333%)

Columns:

NUMERICAL: 2 (66.6667%)
	1: "feature_1" NUMERICAL mean:1.5 min:1 max:2 sd:0.5
	2: "feature_2" NUMERICAL mean:0.475 min:0.1 max:0.9 sd:0.376663

CATEGORICAL: 1 (33.3333%)
	0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

以下评估是在验证集或袋外数据集上计算的。

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.59508
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

变量重要性衡量输入特征对模型的重要性。

这些变量重要性是在训练期间计算的。在测试数据集上分析模型时，可以获得更多（可能更具信息量）的变量重要性。

树的数量 : 300

只打印第一棵树。

Tree #0:
    val:"false" prob:[0.5, 0.5]

有时，您可能希望强制特征的语义为数值。

在下一个示例中，"feature_1" 和 "feature_2" 看起来是布尔类型。但是，我们希望 "feature_1" 是数值类型。

在模型描述中，请注意 "feature_1" 是数值类型，而 "feature_2" 是布尔类型。

In [4]

已复制！





dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [True, True, False, False],
    "feature_2": [True, False, True, False],
})

model = ydf.RandomForestLearner(label="label",
                                features=[ydf.Feature("feature_1", ydf.Semantic.NUMERICAL)],
                                include_all_columns=True,
                                ).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".

model.describe()
dataset = pd.DataFrame({ "label": [True, False, True, False], "feature_1": [True, True, False, False], "feature_2": [True, False, True, False], }) model = ydf.RandomForestLearner(label="label", features=[ydf.Feature("feature_1", ydf.Semantic.NUMERICAL)], include_all_columns=True, ).train(dataset) # Note: include_all_columns=True allows the model to use all the # columns as features, not just the ones in "features". model.describe()

Train model on 4 examples
Model trained in 0:00:00.004133

Out[4]

名称 : RANDOM_FOREST
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	BOOLEAN: 1 (33.3333%)
	CATEGORICAL: 1 (33.3333%)
	NUMERICAL: 1 (33.3333%)

Columns:

BOOLEAN: 1 (33.3333%)
	2: "feature_2" BOOLEAN true_count:2 false_count:2

CATEGORICAL: 1 (33.3333%)
	1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)

NUMERICAL: 1 (33.3333%)
	0: "feature_1" NUMERICAL mean:0.5 min:0 max:1 sd:0.5

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

以下评估是在验证集或袋外数据集上计算的。

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.66759
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

变量重要性衡量输入特征对模型的重要性。

这些变量重要性是在训练期间计算的。在测试数据集上分析模型时，可以获得更多（可能更具信息量）的变量重要性。

树的数量 : 300

只打印第一棵树。

Tree #0:
    val:"false" prob:[0.5, 0.5]

我们来创建一些缺失值。

在数据规范 (Dataspec) 标签页中，注意 "feature_2" 的 num-nas:2 (50%)。这意味着 "feature_2" 包含两个缺失值（即 50% 的值缺失）。

In [6]

已复制！

import math

dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [0.1, 0.8, math.nan, math.nan],
})

model = ydf.RandomForestLearner(label="label").train(dataset)
model.describe()
import math dataset = pd.DataFrame({ "label": [True, False, True, False], "feature_1": [1, 2, 2, 1], "feature_2": [0.1, 0.8, math.nan, math.nan], }) model = ydf.RandomForestLearner(label="label").train(dataset) model.describe()

Train model on 4 examples
Model trained in 0:00:00.005587

Out[6]

名称 : RANDOM_FOREST
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	NUMERICAL: 2 (66.6667%)
	CATEGORICAL: 1 (33.3333%)

Columns:

NUMERICAL: 2 (66.6667%)
	1: "feature_1" NUMERICAL mean:1.5 min:1 max:2 sd:0.5
	2: "feature_2" NUMERICAL num-nas:2 (50%) mean:0.45 min:0.1 max:0.8 sd:0.35

CATEGORICAL: 1 (33.3333%)
	0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

以下评估是在验证集或袋外数据集上计算的。

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.6165
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

变量重要性衡量输入特征对模型的重要性。

这些变量重要性是在训练期间计算的。在测试数据集上分析模型时，可以获得更多（可能更具信息量）的变量重要性。

树的数量 : 300

只打印第一棵树。

Tree #0:
    val:"false" prob:[0.5, 0.5]