In [2]
已复制!
import ydf
import pandas as pd
import ydf import pandas as pd
In [3]
已复制!
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [1, 2, 2, 1],
"feature_2": [0.1, 0.8, 0.9, 0.1],
})
model = ydf.RandomForestLearner(label="label").train(dataset)
dataset = pd.DataFrame({ "label": [True, False, True, False], "feature_1": [1, 2, 2, 1], "feature_2": [0.1, 0.8, 0.9, 0.1], }) model = ydf.RandomForestLearner(label="label").train(dataset)
Train model on 4 examples Model trained in 0:00:00.009728
我们可以在数据规范 (Dataspec) 标签页中看到该特征被检测为数值。
In [3]
已复制!
model.describe()
model.describe()
Out[3]
名称 : RANDOM_FOREST
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB
Number of records: 4 Number of columns: 3 Number of columns by type: NUMERICAL: 2 (66.6667%) CATEGORICAL: 1 (33.3333%) Columns: NUMERICAL: 2 (66.6667%) 1: "feature_1" NUMERICAL mean:1.5 min:1 max:2 sd:0.5 2: "feature_2" NUMERICAL mean:0.475 min:0.1 max:0.9 sd:0.376663 CATEGORICAL: 1 (33.3333%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
以下评估是在验证集或袋外数据集上计算的。
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.59508 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
树的数量 : 300
只打印第一棵树。
Tree #0: val:"false" prob:[0.5, 0.5]
有时,您可能希望强制特征的语义为数值。
在下一个示例中,"feature_1" 和 "feature_2" 看起来是布尔类型。但是,我们希望 "feature_1" 是数值类型。
在模型描述中,请注意 "feature_1" 是数值类型,而 "feature_2" 是布尔类型。
In [4]
已复制!
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [True, True, False, False],
"feature_2": [True, False, True, False],
})
model = ydf.RandomForestLearner(label="label",
features=[ydf.Feature("feature_1", ydf.Semantic.NUMERICAL)],
include_all_columns=True,
).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".
model.describe()
dataset = pd.DataFrame({ "label": [True, False, True, False], "feature_1": [True, True, False, False], "feature_2": [True, False, True, False], }) model = ydf.RandomForestLearner(label="label", features=[ydf.Feature("feature_1", ydf.Semantic.NUMERICAL)], include_all_columns=True, ).train(dataset) # Note: include_all_columns=True allows the model to use all the # columns as features, not just the ones in "features". model.describe()
Train model on 4 examples Model trained in 0:00:00.004133
Out[4]
名称 : RANDOM_FOREST
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB
Number of records: 4 Number of columns: 3 Number of columns by type: BOOLEAN: 1 (33.3333%) CATEGORICAL: 1 (33.3333%) NUMERICAL: 1 (33.3333%) Columns: BOOLEAN: 1 (33.3333%) 2: "feature_2" BOOLEAN true_count:2 false_count:2 CATEGORICAL: 1 (33.3333%) 1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) NUMERICAL: 1 (33.3333%) 0: "feature_1" NUMERICAL mean:0.5 min:0 max:1 sd:0.5 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
以下评估是在验证集或袋外数据集上计算的。
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.66759 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
树的数量 : 300
只打印第一棵树。
Tree #0: val:"false" prob:[0.5, 0.5]
我们来创建一些缺失值。
在数据规范 (Dataspec) 标签页中,注意 "feature_2" 的 num-nas:2 (50%)
。这意味着 "feature_2" 包含两个缺失值(即 50% 的值缺失)。
In [6]
已复制!
import math
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [1, 2, 2, 1],
"feature_2": [0.1, 0.8, math.nan, math.nan],
})
model = ydf.RandomForestLearner(label="label").train(dataset)
model.describe()
import math dataset = pd.DataFrame({ "label": [True, False, True, False], "feature_1": [1, 2, 2, 1], "feature_2": [0.1, 0.8, math.nan, math.nan], }) model = ydf.RandomForestLearner(label="label").train(dataset) model.describe()
Train model on 4 examples Model trained in 0:00:00.005587
Out[6]
名称 : RANDOM_FOREST
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : None
使用调参器训练 : No
模型大小 : 56 kB
Number of records: 4 Number of columns: 3 Number of columns by type: NUMERICAL: 2 (66.6667%) CATEGORICAL: 1 (33.3333%) Columns: NUMERICAL: 2 (66.6667%) 1: "feature_1" NUMERICAL mean:1.5 min:1 max:2 sd:0.5 2: "feature_2" NUMERICAL num-nas:2 (50%) mean:0.45 min:0.1 max:0.8 sd:0.35 CATEGORICAL: 1 (33.3333%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
以下评估是在验证集或袋外数据集上计算的。
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.6165 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
树的数量 : 300
只打印第一棵树。
Tree #0: val:"false" prob:[0.5, 0.5]