输入 [ ]
已复制!
pip install ydf -U
pip install ydf -U
什么是 tf.data.Dataset?¶
tf.data.Dataset 是 TensorFlow 和 JAX 机器学习库的一种运行时数据集格式。它使得从多种不同格式加载数据集并对其应用转换变得容易。Yggdrasil Decision Forests (YDF) 可以原生使用 tf.data.Datasets。
tf.data.Dataset 不应与 tf.Dataset 混淆,后者是面向机器学习从业者的数据集集合。请注意,tf.Dataset 中的一些数据集也以 tf.data.Dataset 的形式提供。
与 YDF 一起使用 tf.data.Dataset 时
- 确保数据集是有限的,即它不会无限重复。不要打乱数据集。
- 与神经网络不同,数据集的批次大小不影响 YDF 模型。然而,对于 TensorFlow 来说,小批次大小可能会很慢。因此,建议使用较大的批次大小。例如,1000 是一个很好的经验值。
创建 tf.data.Dataset¶
有几种创建 tf.data.Datasets 的方法。在这里,我们使用 tf.data.Dataset.from_tensor_slices
将 python 列表数组转换为 tf.data.Dataset。这仅用于示例目的,因为将 NumPy 数组直接馈送给 YDF 更高效。
输入 [1]
已复制!
import ydf
import numpy as np
import tensorflow as tf
import ydf import numpy as np import tensorflow as tf
2023-11-19 18:08:44.092683: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-11-19 18:08:44.143396: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-11-19 18:08:44.144583: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-11-19 18:08:45.101126: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
让我们下载存储在 TFRecord 格式中的数据集。TFRecord 是一种常用的容器格式,用于存储序列化的 TensorFlow Example proto。TFRecord 文件通常使用 gzip 压缩。打开压缩的 TFRecord 文件时,必须指定 `compression_type`` 以避免遇到无效文件错误。
输入 [3]
已复制!
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q !wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q
与 `pandas.read_csv`` 不同,使用 tf.data.Dataset 读取 TFRecord 时,必须指定要加载的特征。
输入 [18]
已复制!
def create_tf_data_dataset(path):
serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP")
def parse_tf_example(serialized_example):
"""Parse a binary serialized tf.Example."""
return tf.io.parse_single_example(
serialized_example,
{
"age": tf.io.FixedLenFeature([], dtype=tf.int64),
"capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64),
"hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64),
"workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""),
"education": tf.io.FixedLenFeature([], dtype=tf.string),
"income": tf.io.FixedLenFeature([], dtype=tf.string),
# Those are just a few features available in the dataset.
}
)
return serialized_examples.map(parse_tf_example)
non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz")
non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")
def create_tf_data_dataset(path): serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP") def parse_tf_example(serialized_example): """解析二进制序列化的 tf.Example。""" return tf.io.parse_single_example( serialized_example, { "age": tf.io.FixedLenFeature([], dtype=tf.int64), "capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64), "hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64), "workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""), "education": tf.io.FixedLenFeature([], dtype=tf.string), "income": tf.io.FixedLenFeature([], dtype=tf.string), # 这些只是数据集中可用的一些特征。 } ) return serialized_examples.map(parse_tf_example) non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz") non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")
在应用 batch
运算符之前检查加载的示例会更容易。
输入 [13]
已复制!
for example in non_batched_train_ds.take(5):
print(example)
for example in non_batched_train_ds.take(5): print(example)
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=44>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'7th-8th'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>} {'age': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>} {'age': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=37>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>} {'age': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=50>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>} {'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=20051>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'>50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Self-emp-inc'>}
如前所述,批次大小不影响模型。1000 是一个很好的默认值。
输入 [19]
已复制!
train_ds = non_batched_train_ds.batch(1000)
test_ds = non_batched_test_ds.batch(1000)
train_ds = non_batched_train_ds.batch(1000) test_ds = non_batched_test_ds.batch(1000)
训练模型¶
所有 YDF 方法(例如,训练、评估、分析)都可以原生使用 tf.data.Dataset。
输入 [20]
已复制!
learner = ydf.GradientBoostedTreesLearner(label="income")
model = learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(label="income") model = learner.train(train_ds)
Warning: Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.
WARNING:absl:Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.
Train model on 22792 examples Model trained in 0:00:05.323891
然后我们可以评估模型。