TensorFlow Dataset¶

设置¶

输入 [ ]

已复制！

pip install ydf -U
pip install ydf -U

什么是 tf.data.Dataset？¶

tf.data.Dataset 是 TensorFlow 和 JAX 机器学习库的一种运行时数据集格式。它使得从多种不同格式加载数据集并对其应用转换变得容易。Yggdrasil Decision Forests (YDF) 可以原生使用 tf.data.Datasets。

tf.data.Dataset 不应与 tf.Dataset 混淆，后者是面向机器学习从业者的数据集集合。请注意，tf.Dataset 中的一些数据集也以 tf.data.Dataset 的形式提供。

与 YDF 一起使用 tf.data.Dataset 时

确保数据集是有限的，即它不会无限重复。不要打乱数据集。
与神经网络不同，数据集的批次大小不影响 YDF 模型。然而，对于 TensorFlow 来说，小批次大小可能会很慢。因此，建议使用较大的批次大小。例如，1000 是一个很好的经验值。

创建 tf.data.Dataset¶

有几种创建 tf.data.Datasets 的方法。在这里，我们使用 tf.data.Dataset.from_tensor_slices 将 python 列表数组转换为 tf.data.Dataset。这仅用于示例目的，因为将 NumPy 数组直接馈送给 YDF 更高效。

输入 [1]

已复制！

import ydf
import numpy as np
import tensorflow as tf
import ydf import numpy as np import tensorflow as tf

2023-11-19 18:08:44.092683: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-19 18:08:44.143396: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-19 18:08:44.144583: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-19 18:08:45.101126: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

让我们下载存储在 TFRecord 格式中的数据集。TFRecord 是一种常用的容器格式，用于存储序列化的 TensorFlow Example proto。TFRecord 文件通常使用 gzip 压缩。打开压缩的 TFRecord 文件时，必须指定 `compression_type`` 以避免遇到无效文件错误。

输入 [3]

已复制！

!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q !wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q

与 `pandas.read_csv`` 不同，使用 tf.data.Dataset 读取 TFRecord 时，必须指定要加载的特征。

输入 [18]

已复制！





def create_tf_data_dataset(path):
    serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP")

    def parse_tf_example(serialized_example):
        """Parse a binary serialized tf.Example."""
        return tf.io.parse_single_example(
            serialized_example,
            {
                "age": tf.io.FixedLenFeature([], dtype=tf.int64),
                "capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64),
                "hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64),
                "workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""),
                "education": tf.io.FixedLenFeature([], dtype=tf.string),
                "income": tf.io.FixedLenFeature([], dtype=tf.string),
                # Those are just a few features available in the dataset.
            }
        )

    return serialized_examples.map(parse_tf_example)

non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz")
non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")
def create_tf_data_dataset(path): serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP") def parse_tf_example(serialized_example): """解析二进制序列化的 tf.Example。""" return tf.io.parse_single_example( serialized_example, { "age": tf.io.FixedLenFeature([], dtype=tf.int64), "capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64), "hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64), "workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""), "education": tf.io.FixedLenFeature([], dtype=tf.string), "income": tf.io.FixedLenFeature([], dtype=tf.string), # 这些只是数据集中可用的一些特征。 } ) return serialized_examples.map(parse_tf_example) non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz") non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")

在应用 batch 运算符之前检查加载的示例会更容易。

输入 [13]

已复制！

for example in non_batched_train_ds.take(5):
    print(example)
for example in non_batched_train_ds.take(5): print(example)

{'age': <tf.Tensor: shape=(), dtype=int64, numpy=44>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'7th-8th'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=37>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=50>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=20051>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'>50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Self-emp-inc'>}

如前所述，批次大小不影响模型。1000 是一个很好的默认值。

输入 [19]

已复制！

train_ds = non_batched_train_ds.batch(1000)
test_ds = non_batched_test_ds.batch(1000)
train_ds = non_batched_train_ds.batch(1000) test_ds = non_batched_test_ds.batch(1000)

训练模型¶

所有 YDF 方法（例如，训练、评估、分析）都可以原生使用 tf.data.Dataset。

输入 [20]

已复制！

learner = ydf.GradientBoostedTreesLearner(label="income")
model = learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(label="income") model = learner.train(train_ds)

Warning: Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.

WARNING:absl:Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.

Train model on 22792 examples
Model trained in 0:00:05.323891

然后我们可以评估模型。

输入 [21]

已复制！

evaluation = model.evaluate(test_ds)
evaluation
evaluation = model.evaluate(test_ds) evaluation

输出[21]

准确率

0.839286

AUC: '>50K' 对比其他

0.878923

PR-AUC: '>50K' 对比其他

0.744216

损失

0.34535

样本数量

22792

样本数量（加权）

22792

混淆矩阵

真实标签 \ 预测标签	<=50K	>50K
<=50K	16526	782
>50K	2881	2603