分类¶

设置¶

In [ ]

Copied!

pip install ydf -U
pip install ydf -U

什么是分类？¶

分类的任务是预测一个分类值，例如有限集合中的枚举、类型或类别。例如，从可能的颜色集合 RED、BLUE、GREEN 中预测一种颜色就是分类任务。分类模型的输出是可能类别上的概率分布。预测的类别是概率最高的那个。

当只有两个类别时，我们称之为二元分类。在这种情况下，模型只返回一个概率。

分类标签可以是字符串、整数或布尔值。

训练分类模型¶

模型的任务（例如，分类、回归）由 task 学习器参数确定。此参数的默认值是 ydf.Task.CLASSIFICATION，这意味着默认情况下，YDF 训练分类模型。

In [8]

Copied!





# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets

# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)
# Load libraries import ydf # Yggdrasil Decision Forests import pandas as pd # We use Pandas to load small datasets # Download a classification dataset and load it as a Pandas DataFrame. ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset" train_ds = pd.read_csv(f"{ds_path}/adult_train.csv") test_ds = pd.read_csv(f"{ds_path}/adult_test.csv") # Print the first 5 training examples train_ds.head(5)

Out[8]

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	44	Private	228057	7th-8th	4	Married-civ-spouse	Machine-op-inspct	Wife	White	Female	0	40	Dominican-Republic	<=50K
1	20	Private	299047	Some-college	10	Never-married	Other-service	Not-in-family	White	Female	0	20	United-States	<=50K
2	40	Private	342164	HS-grad	9	Separated	Adm-clerical	Unmarried	White	Female	0	37	United-States	<=50K
3	30	Private	361742	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	50	United-States	<=50K
4	67	Self-emp-inc	171564	HS-grad	9	Married-civ-spouse	Prof-specialty	Wife	White	Female	20051	30	England	>50K

标签列是

In [9]

Copied!

train_ds["income"]
train_ds["income"]

Out[9]

	income
0	<=50K
1	<=50K
2	<=50K
3	<=50K
4	>50K
...	...
22787	<=50K
22788	>50K
22789	<=50K
22790	<=50K
22791	<=50K

22792 rows × 1 columns

dtype: object

我们可以训练一个分类模型

In [10]

Copied!

# Note: ydf.Task.CLASSIFICATION is the default value of "task"
model = ydf.RandomForestLearner(label="income",
                                task=ydf.Task.CLASSIFICATION).train(train_ds)
# Note: ydf.Task.CLASSIFICATION is the default value of "task" model = ydf.RandomForestLearner(label="income", task=ydf.Task.CLASSIFICATION).train(train_ds)

Train model on 22792 examples
Model trained in 0:00:01.808830

分类模型使用准确率、混淆矩阵、ROC-AUC 和 PR-AUC 进行评估。您可以使用 ROC 和 PR 图绘制丰富的评估结果。

In [11]

Copied!

evaluation = model.evaluate(test_ds)

evaluation
evaluation = model.evaluate(test_ds) evaluation

Out[11]

分类模型的评估

准确率 (Accuracy)

最简单的指标。它是预测正确（与真实情况匹配）的百分比。
示例：如果模型正确识别出 100 张图片中的 90 张是猫或狗，则准确率为 90%。

混淆矩阵 (Confusion Matrix)

一个表格，显示以下计数：

真阳性 (TP)： 模型正确预测为阳性。
真阴性 (TN)： 模型正确预测为阴性。
假阳性 (FP)： 模型错误预测为阳性（“误报”）。
假阴性 (FN)： 模型错误预测为阴性（“漏报”）。

阈值 (Threshold)

YDF 分类模型预测每个类别的概率。阈值决定了将某事物分类为阳性或阴性的截止值。
示例：如果阈值为 0.5，任何高于 0.5 的预测可能被归类为“垃圾邮件”，低于 0.5 则归类为“非垃圾邮件”。

ROC 曲线 (Receiver Operating Characteristic Curve)

绘制不同阈值下真阳性率 (TPR) 对假阳性率 (FPR) 的图。

TPR (敏感度或召回率)： TP / (TP + FN) - 模型捕获了多少实际阳性？
FPR： FP / (FP + TN) - 有多少阴性被错误地分类为阳性？

解释： 一个好的模型其 ROC 曲线会紧贴左上角（高 TPR，低 FPR）。

AUC (Area Under the ROC Curve)

一个单一数字，总结了 ROC 曲线所示的整体性能。AUC 是比准确率更稳定的指标。多类别分类模型会评估一个类别与所有其他类别的对比。
解释： 范围从 0 到 1。完美模型的 AUC 为 1，而随机模型的 AUC 为 0.5。越高越好。

精确率-召回率曲线 (Precision-Recall Curve)

绘制不同阈值下精确率对召回率的图。

精确率： TP / (TP + FP) - 在模型标记为阳性的所有预测中，有多少是实际阳性的？
召回率 (与 TPR 相同)： TP / (TP + FN) - 在所有实际阳性案例中，模型正确识别了多少？

解释： 一个好的模型的曲线保持在高位（同时具有高精确率和高召回率）。当处理不平衡数据集时（例如，一个类别比另一个类别罕见得多），它特别有用。

PR-AUC (Area Under the Precision-Recall Curve)

类似于 AUC，但针对精确率-召回率曲线。一个单一数字总结性能。多类别分类模型会评估一个类别与所有其他类别的对比。越高越好。

阈值 / 准确率曲线 (Threshold / Accuracy Curve)

一个图，显示模型的准确率如何随着分类阈值的变化而变化。

阈值 / 体积曲线 (Threshold / Volume Curve)

一个图，显示分类为阳性的数据点数量如何随着阈值的变化而变化。

准确率 (accuracy)

0.8658

AUC：'>50K' 与其他类别对比

0.909324

PR-AUC：'>50K' 与其他类别对比

0.790127

损失 (loss)

0.386015

样本数量 (num examples)

9769

样本数量 (加权) (num examples (weighted))

9769

混淆矩阵 (Confusion matrix)

标签 \ 预测 (Label \ Pred)	<=50K	>50K
<=50K	6962	861
>50K	450	1496

可以直接在评估对象中访问评估指标。

In [12]

Copied!

print(evaluation.accuracy)
print(evaluation.accuracy)

0.8657999795270754

进行预测¶

分类模型预测标签类别的概率。二元分类模型输出 model.label_classes() 中第一个类别的概率。

In [13]

Copied!

# Print the label classes.
print(model.label_classes())
# Predict the probability of the first class.
print(model.predict(test_ds))
# Print the label classes. print(model.label_classes()) # Predict the probability of the first class. print(model.predict(test_ds))

['<=50K', '>50K']
[0.01333333 0.12999995 0.9499992  ... 0.06000001 0.02333334 0.        ]

我们也可以直接预测最可能的类别。

警告：请始终使用 model.predict_class() 或使用 model.label_classes() 手动检查类别的顺序。请注意，类别顺序可能会根据训练数据集或 YDF 更新而改变。

In [14]

Copied!

model.predict_class(test_ds)
model.predict_class(test_ds)

Out[14]

array(['<=50K', '<=50K', '>50K', ..., '<=50K', '<=50K', '<=50K'],
      shape=(9769,), dtype='<U5')