In [ ]
已复制!
import ydf
import pandas as pd
import ydf import pandas as pd
In [ ]
已复制!
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": ["red", "red", "blue", "green"],
"feature_2": ["hot", "hot", "cold", ""],
})
model = ydf.RandomForestLearner(label="label").train(dataset)
dataset = pd.DataFrame({ "label": [True, False, True, False], "feature_1": ["red", "red", "blue", "green"], "feature_2": ["hot", "hot", "cold", ""], }) model = ydf.RandomForestLearner(label="label").train(dataset)
Train model on 4 examples Model trained in 0:00:00.008941
我们可以看到这些特征在 Dataspec 标签页中被检测为分类特征。
In [ ]
已复制!
model.describe()
model.describe()
Out[ ]
名称 : RANDOM_FOREST
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : 无
使用调参器训练 : 否
模型大小 : 57 kB
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : 无
使用调参器训练 : 否
模型大小 : 57 kB
Number of records: 4 Number of columns: 3 Number of columns by type: CATEGORICAL: 3 (100%) Columns: CATEGORICAL: 3 (100%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) 1: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%) 2: "feature_2" CATEGORICAL num-nas:1 (25%) has-dict vocab-size:1 num-oods:3 (100%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
以下评估是基于验证集或袋外数据集计算的。
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.57374 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
树的数量 : 300
仅打印第一棵树。
Tree #0: val:"false" prob:[0.5, 0.5]
有时,您可能想要强制某个特征的语义为分类特征。
在下一个例子中,"feature_1" 和 "feature_2" 是整数,因此它们将被自动检测为数值特征。但是,我们希望 "feature_1" 被检测为分类特征。
在模型描述中,注意 "feature_1" 是分类特征,而 "feature_2" 是数值特征。
In [ ]
已复制!
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [1, 2, 2, 1],
"feature_2": [5, 6, 7, 6],
})
model = ydf.RandomForestLearner(label="label",
features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
include_all_columns=True,
).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".
model.describe()
dataset = pd.DataFrame({ "label": [True, False, True, False], "feature_1": [1, 2, 2, 1], "feature_2": [5, 6, 7, 6], }) model = ydf.RandomForestLearner(label="label", features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)], include_all_columns=True, ).train(dataset) # 注意:include_all_columns=True 允许模型使用所有列作为特征,而不仅仅是 "features" 中指定的列。 model.describe()
Train model on 4 examples Model trained in 0:00:00.004352
Out[ ]
名称 : RANDOM_FOREST
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : 无
使用调参器训练 : 否
模型大小 : 57 kB
任务 : CLASSIFICATION
标签 : label
特征 (2) : feature_1 feature_2
权重 : 无
使用调参器训练 : 否
模型大小 : 57 kB
Number of records: 4 Number of columns: 3 Number of columns by type: CATEGORICAL: 2 (66.6667%) NUMERICAL: 1 (33.3333%) Columns: CATEGORICAL: 2 (66.6667%) 0: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%) 1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) NUMERICAL: 1 (33.3333%) 2: "feature_2" NUMERICAL mean:6 min:5 max:7 sd:0.707107 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
以下评估是基于验证集或袋外数据集计算的。
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.57805 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
树的数量 : 300
仅打印第一棵树。
Tree #0: val:"false" prob:[0.5, 0.5]