实用工具¶

实用工具

verbose ¶

verbose(level: Union[int, bool] = 2) -> int

设置 YDF 的详细程度级别。

详细程度级别如下：

0 或 False：不打印日志。1 或 True：在 Colab 或笔记本单元格中打印少量日志。在控制台中打印所有日志。这是默认的详细程度级别。2：在所有界面上打印所有日志。

使用示例

import ydf

save_verbose = ydf.verbose(0)  # Hide all logs
learner = ydf.RandomForestLearner(label="label")
model = learner.train(pd.DataFrame({"feature": [0, 1], "label": [0, 1]}))
ydf.verbose(save_verbose)  # Restore verbose level

参数

名称	类型	描述	默认值
`level`	`Union[int, bool]`	新的详细程度级别。	`2`

返回值

类型	描述
`int`	之前的详细程度级别。

load_model ¶

load_model(
    directory: str,
    advanced_options: ModelIOOptions = ModelIOOptions(),
) -> ModelType

从磁盘加载 YDF 模型。

使用示例

import pandas as pd
import ydf

# Create a model
dataset = pd.DataFrame({"feature": [0, 1], "label": [0, 1]})
learner = ydf.RandomForestLearner(label="label")
model = learner.train(dataset)

# Save model
model.save("/tmp/my_model")

# Load model
loaded_model = ydf.load_model("/tmp/my_model")

# Make predictions
model.predict(dataset)
loaded_model.predict(dataset)

如果目录包含多个 YDF 模型，则模型通过其前缀唯一标识。要使用的前缀可以在高级选项中指定。如果目录仅包含单个模型，则会自动检测正确的前缀。

参数

名称	类型	描述	默认值
`directory`	`str`	包含模型的目录。	必需
`advanced_options`	`ModelIOOptions`	模型加载的高级选项。	`ModelIOOptions()`

返回值

类型	描述
`ModelType`	用于推理、评估或检查的模型

deserialize_model ¶

deserialize_model(data: bytes) -> ModelType

加载序列化的 YDF 模型。

使用示例

import pandas as pd
import ydf

# Create a model
dataset = pd.DataFrame({"feature": [0, 1], "label": [0, 1]})
learner = ydf.RandomForestLearner(label="label")
model = learner.train(dataset)

# Serialize model
# Note: serialized_model is a bytes.
serialized_model = model.serialize()

# Deserialize model
deserialized_model = ydf.deserialize_model(serialized_model)

# Make predictions
model.predict(dataset)
deserialized_model.predict(dataset)

参数

名称	类型	描述	默认值
`data`	`bytes`	序列化模型。	必需

返回值

类型	描述
`ModelType`	用于推理、评估或检查的模型

Feature `dataclass` ¶

Feature(
    name: str,
    semantic: Optional[Semantic] = None,
    max_vocab_count: Optional[int] = None,
    min_vocab_frequency: Optional[int] = None,
    num_discretized_numerical_bins: Optional[int] = None,
    monotonic: MonotonicConstraint = None,
)

基类: object

单列的语义和参数。

此类用于：

限制模型的输入特征。
手动指定特征的语义。
指定特征特定的超参数。

属性

名称	类型	描述
`name`	`str`	列或特征的名称。
`semantic`	`Optional[Semantic]`	列的语义。如果为 None，则自动确定语义。语义控制模型如何解释列。使用错误的语义（例如，数值而非类别）会损害模型的质量。
`max_vocab_count`	`Optional[int]`	仅适用于 CATEGORICAL 和 CATEGORICAL_SET 列。以字符串形式存储的唯一类别值的数量。如果存在更多类别值，则将频率最低的值归入“词汇外 (Out-of-vocabulary)”项。减小此值可以提高或损害模型质量。如果 max_vocab_count = -1，则列中的值数量不受限制。
`min_vocab_frequency`	`Optional[int]`	仅适用于 CATEGORICAL 和 CATEGORICAL_SET 列。类别值的最小出现次数。在训练数据集中出现次数少于“min_vocab_frequency”的值将被视为“词汇外 (Out-of-vocabulary)”。
`num_discretized_numerical_bins`	`Optional[int]`	仅适用于 DISCRETIZED_NUMERICAL 列。用于离散化 DISCRETIZED_NUMERICAL 列的 bin 数量。默认为 255 个 bin，即 254 个边界。
`monotonic`	`MonotonicConstraint`	特征与模型输出之间的单调约束。对于无约束特征，请使用 `None`（默认；或 0）。使用 `Monotonic.INCREASING`（或 +1）可确保模型随特征单调增加。使用 `Monotonic.DECREASING`（或 -1）可确保模型随特征单调减少。

max_vocab_count `class-attribute` `instance-attribute` ¶

max_vocab_count: Optional[int] = None

min_vocab_frequency `class-attribute` `instance-attribute` ¶

min_vocab_frequency: Optional[int] = None

monotonic `class-attribute` `instance-attribute` ¶

monotonic: MonotonicConstraint = None

name `instance-attribute` ¶

name: str

normalized_monotonic `property` ¶

normalized_monotonic: Optional[Monotonic]

返回“monotonic”属性的标准化版本。

num_discretized_numerical_bins `class-attribute` `instance-attribute` ¶

num_discretized_numerical_bins: Optional[int] = None

semantic `class-attribute` `instance-attribute` ¶

semantic: Optional[Semantic] = None

from_column_def `classmethod` ¶

from_column_def(column_def: ColumnDef)

将 ColumnDef 转换为 Column。

to_proto_column_guide ¶

to_proto_column_guide() -> ColumnGuide

从给定的规范创建 proto ColumnGuide。

Column `dataclass` ¶

Column(
    name: str,
    semantic: Optional[Semantic] = None,
    max_vocab_count: Optional[int] = None,
    min_vocab_frequency: Optional[int] = None,
    num_discretized_numerical_bins: Optional[int] = None,
    monotonic: MonotonicConstraint = None,
)

基类: object

单列的语义和参数。

此类用于：

限制模型的输入特征。
手动指定特征的语义。
指定特征特定的超参数。

属性

名称	类型	描述
`name`	`str`	列或特征的名称。
`semantic`	`Optional[Semantic]`	列的语义。如果为 None，则自动确定语义。语义控制模型如何解释列。使用错误的语义（例如，数值而非类别）会损害模型的质量。
`max_vocab_count`	`Optional[int]`	仅适用于 CATEGORICAL 和 CATEGORICAL_SET 列。以字符串形式存储的唯一类别值的数量。如果存在更多类别值，则将频率最低的值归入“词汇外 (Out-of-vocabulary)”项。减小此值可以提高或损害模型质量。如果 max_vocab_count = -1，则列中的值数量不受限制。
`min_vocab_frequency`	`Optional[int]`	仅适用于 CATEGORICAL 和 CATEGORICAL_SET 列。类别值的最小出现次数。在训练数据集中出现次数少于“min_vocab_frequency”的值将被视为“词汇外 (Out-of-vocabulary)”。
`num_discretized_numerical_bins`	`Optional[int]`	仅适用于 DISCRETIZED_NUMERICAL 列。用于离散化 DISCRETIZED_NUMERICAL 列的 bin 数量。默认为 255 个 bin，即 254 个边界。
`monotonic`	`MonotonicConstraint`	特征与模型输出之间的单调约束。对于无约束特征，请使用 `None`（默认；或 0）。使用 `Monotonic.INCREASING`（或 +1）可确保模型随特征单调增加。使用 `Monotonic.DECREASING`（或 -1）可确保模型随特征单调减少。

max_vocab_count `class-attribute` `instance-attribute` ¶

max_vocab_count: Optional[int] = None

min_vocab_frequency `class-attribute` `instance-attribute` ¶

min_vocab_frequency: Optional[int] = None

monotonic `class-attribute` `instance-attribute` ¶

monotonic: MonotonicConstraint = None

name `instance-attribute` ¶

name: str

normalized_monotonic `property` ¶

normalized_monotonic: Optional[Monotonic]

返回“monotonic”属性的标准化版本。

num_discretized_numerical_bins `class-attribute` `instance-attribute` ¶

num_discretized_numerical_bins: Optional[int] = None

semantic `class-attribute` `instance-attribute` ¶

semantic: Optional[Semantic] = None

from_column_def `classmethod` ¶

from_column_def(column_def: ColumnDef)

将 ColumnDef 转换为 Column。

to_proto_column_guide ¶

to_proto_column_guide() -> ColumnGuide

从给定的规范创建 proto ColumnGuide。

Task ¶

基类: Enum

模型解决的任务。

使用示例

learner = ydf.RandomForestLearner(label="income",
                                  task=ydf.Task.CLASSIFICATION)
model = learner.train(dataset)
assert model.task() == ydf.Task.CLASSIFICATION

并非所有任务都兼容所有学习器和/或超参数。有关更多信息，请参阅各个任务教程的文档。

属性

名称	类型	描述
`CLASSIFICATION`		预测类别标签，即枚举的一个项。
`REGRESSION`		预测数值标签，即一个数量。
`RANKING`		根据标签值对项目进行排序。使用默认 NDCG 设置时，标签预计在 0 到 4 之间，具有 NDCG 语义（0：完全不相关，4：完美匹配）。
`CATEGORICAL_UPLIFT`		预测治疗对类别结果的增量影响。
`NUMERICAL_UPLIFT`		预测治疗对数值结果的增量影响。
`ANOMALY_DETECTION`		预测实例是类似于大多数训练数据还是异常（也称为离群点）。异常检测预测是一个介于 0 和 1 之间的值，其中 0 表示最可能的正常实例，1 表示最可能的异常实例。

ANOMALY_DETECTION `class-attribute` `instance-attribute` ¶

ANOMALY_DETECTION = 'ANOMALY_DETECTION'

CATEGORICAL_UPLIFT `class-attribute` `instance-attribute` ¶

CATEGORICAL_UPLIFT = 'CATEGORICAL_UPLIFT'

CLASSIFICATION `class-attribute` `instance-attribute` ¶

CLASSIFICATION = 'CLASSIFICATION'

NUMERICAL_UPLIFT `class-attribute` `instance-attribute` ¶

NUMERICAL_UPLIFT = 'NUMERICAL_UPLIFT'

RANKING `class-attribute` `instance-attribute` ¶

RANKING = 'RANKING'

REGRESSION `class-attribute` `instance-attribute` ¶

REGRESSION = 'REGRESSION'

Semantic ¶

基类: Enum

列的语义（例如，数值、类别）。

确定模型如何解释列。类似于 YDF DataSpecification 的“ColumnType”。

属性

名称	类型	描述
`NUMERICAL`		数值。通常用于具有完全顺序的数量或计数。例如，一个人的年龄，或袋子里的物品数量。可以是浮点数或整数。缺失值表示为 math.nan。
`CATEGORICAL`		类别值。通常用于有限可能值集合中的类型/类别，不排序。例如，集合 {RED, BLUE, GREEN} 中的颜色 RED。可以是字符串或整数。缺失值表示为 ""（空字符串）或值 -2。词汇外值（即训练中未见过的值）表示为任何新的字符串值或值 -1。整数类别值：(1) 训练逻辑和模型表示基于值是密集的假设进行了优化。(2) 内部存储为 int32。值应 <~2B。(3) 可能值的数量是根据训练数据集自动计算的。在推理期间，大于训练期间见过的任何整数值将被视为词汇外。(4) 最小频率和最大词汇量大小约束不适用。
`HASH`		字符串值的哈希。仅当值之间的相等性很重要（而不是值本身）时使用。目前仅用于排序问题中的组，例如查询/文档问题中的查询。哈希使用 Google 的 farmhash 计算并存储为 uint64。
`CATEGORICAL_SET`		类别值集合。非常适合表示分词后的文本。可以是字符串。与 CATEGORICAL 不同，CATEGORICAL_SET 中项目的数量可以在不同示例之间变化。特征值内部值的顺序无关紧要。
`BOOLEAN`		布尔值。可以是浮点数或整数。缺失值表示为 math.nan。如果数值张量包含多个值，其大小应为常量，且每个维度独立处理（每个维度应始终具有相同的“含义”）。
`DISCRETIZED_NUMERICAL`		自动离散化到 bin 中的数值。离散化数值列比（非离散化）数值列训练更快，但可能会对模型质量产生负面影响。使用 `discretize_numerical_columns=True` 等同于在 `column` 参数中设置列语义为 DISCRETIZED_NUMERICAL。更多详细信息请参阅 DISCRETIZED_NUMERICAL 的定义。
`NUMERICAL_VECTOR_SEQUENCE`		向量序列特征的每个值是一个固定大小数值向量的序列（即有序列表）。向量序列中的所有向量应具有相同的大小，但每个向量序列可以有不同数量的向量。向量序列适用于，例如，表示多变量时间序列或 LLM 标记列表。

BOOLEAN `class-attribute` `instance-attribute` ¶

BOOLEAN = 5

CATEGORICAL `class-attribute` `instance-attribute` ¶

CATEGORICAL = 2

CATEGORICAL_SET `class-attribute` `instance-attribute` ¶

CATEGORICAL_SET = 4

DISCRETIZED_NUMERICAL `class-attribute` `instance-attribute` ¶

DISCRETIZED_NUMERICAL = 6

HASH `class-attribute` `instance-attribute` ¶

HASH = 3

NUMERICAL `class-attribute` `instance-attribute` ¶

NUMERICAL = 1

NUMERICAL_VECTOR_SEQUENCE `class-attribute` `instance-attribute` ¶

NUMERICAL_VECTOR_SEQUENCE = 7

from_proto_type `classmethod` ¶

from_proto_type(column_type: ColumnType)

to_proto_type ¶

to_proto_type() -> ColumnType

evaluate_predictions ¶

evaluate_predictions(
    predictions: ndarray,
    labels: ndarray,
    task: Task,
    *,
    weights: Optional[ndarray] = None,
    label_classes: Optional[List[str]] = None,
    ranking_groups: Optional[ndarray] = None,
    bootstrapping: Union[bool, int] = False,
    ndcg_truncation: int = 5,
    mrr_truncation: int = 5,
    random_seed: int = 1234,
    num_threads: Optional[int] = None
) -> Evaluation

根据标签评估预测结果。

此函数允许使用 YDF 的评估格式，针对标签评估任何模型（可能不是 YDF 模型）的预测结果。

YDF 模型应直接使用 model.evaluate 进行评估，这更高效便捷。

对于二分类任务，predictions 应包含预测概率，形状应为 [n] 或 [n,2]，其中 n 是样本数。如果 predictions 形状为 [n]，则应包含“正类”的概率。如果是 [n,2]，则 predicions[:0] 和 predicions[:1] 应分别表示“负类”和“正类”的概率。标签应是形状为 [n] 的一维数组，包含整数 0 和 1 或字符串。如果标签是字符串，必须提供 label_classes，且“负类”在前。对于整数标签，提供 label_classes 是可选的，仅用于显示。

对于多分类任务，predictions 应包含预测概率，形状应为 [n,k]，其中 n 是样本数。predicions[:i] 应包含第 i 类的概率。标签应是形状为 [n] 的一维整数或字符串数组。类名由 label_classes 提供，顺序与预测结果相同。如果标签是整数，应在 0, .., num_classes -1 的范围内。如果标签是字符串，必须提供 label_classes。对于整数标签，提供 label_classes 是可选的，仅用于显示。

对于回归任务，predictions 应包含预测值，作为形状为 [n] 的一维浮点数组，其中 n 是样本数。标签也应是形状为 [n] 的一维浮点数组。

对于排序任务，predictions 应包含预测值，作为形状为 [n] 的一维浮点数组，其中 n 是样本数。标签也应是形状为 [n] 的一维浮点数组。排序组应是形状为 [n] 的整数数组。

不支持提升评估和异常检测评估。

使用示例

from sklearn.linear_model import LogisticRegression
import ydf

X_train, X_test, y_train, y_test = ...  # Load data

model = LogisticRegression()
model.fit(X_train, y_train)
predictions: np.ndarray = model.predict_proba(X_test)
evaluation = ydf.evaluate.evaluate_predictions(
    predictions, y_test, ydf.Task.CLASSIFICATION
)
print(evaluation)
evaluation  # Prints an interactive report in IPython / Colab notebooks.

import numpy as np
import ydf

predictions = np.linspace(0, 1, 100)
labels = np.concatenate([np.ones(50), np.zeros(50)]).astype(float)
evaluation = ydf.evaluate.evaluate_predictions(
    predictions, labels, ydf.Task.REGRESSIONS
)
print(evaluation)
evaluation  # Prints an interactive report in IPython / Colab notebooks.

参数

名称	类型	描述	默认值
`predictions`	`ndarray`	待评估的预测数组。“task”参数定义了预测数组的预期形状。	必需
`labels`	`ndarray`	标签值。“task”参数定义了预测数组的预期形状。	必需
`task`	`Task`	模型的任务。	必需
`weights`	`Optional[ndarray]`	样本权重，作为形状为 [n] 的一维浮点数组。如果未提供，所有样本具有相同权重。	`None`
`label_classes`	`Optional[List[str]]`	标签名称。仅用于分类任务。	`None`
`ranking_groups`	`Optional[ndarray]`	排序组，作为形状为 [n] 的一维整数数组。仅用于排序任务。	`None`
`bootstrapping`	`Union[bool, int]`	控制是否使用自助法评估置信区间和统计检验（即，所有以“[B]”结尾的指标）。如果设置为 false，则禁用自助法。如果设置为 true，则启用自助法并使用 2000 个自助样本。如果设置为整数，则指定要使用的自助样本数。在这种情况下，如果数量小于 100，将引发错误，因为自助法不会产生有用的结果。	`False`
`ndcg_truncation`	`int`	控制 NDCG 指标在哪个排序位置截断。默认为 5。对于非排序模型忽略。	`5`
`mrr_truncation`	`int`	控制 MRR 指标损失在哪个排序位置截断。默认为 5。对于非排序模型忽略。	`5`
`random_seed`	`int`	采样用的随机种子。	`1234`
`num_threads`	`Optional[int]`	用于运行模型的线程数。	`None`

返回值

类型	描述
`Evaluation`	评估指标。

start_worker ¶

start_worker(
    port: int, blocking: bool = True
) -> Optional[Callable[[], None]]

在给定端口本地启动一个工作进程。

工作进程的地址通过 workers 参数传递给学习器。

使用示例

# On worker machine #0 at address 192.168.0.1
ydf.start_worker(9000)

# On worker machine #1 at address 192.168.0.2
ydf.start_worker(9000)

# On manager
learner = ydf.DistributedGradientBoostedTreesLearner(
      label = "my_label",
      working_dir = "/shared/working_dir,
      resume_training = True,
      workers = ["192.168.0.1:9000", "192.168.0.2:9000"],
  ).train(dataset)

非阻塞调用的示例

# On worker machine
stop_worker = start_worker(blocking=False)
# Do some work with the worker
stop_worker() # Stops the worker

参数

名称	类型	描述	默认值
`port`	`int`	工作进程的 TCP 端口。	必需
`blocking`	`bool`	如果为 true（默认），函数将阻塞直到工作进程停止（例如，错误、管理器中断）。如果为 false，函数为非阻塞，并返回一个可调用对象，调用该对象将停止工作进程。	`True`

返回值

类型	描述
`Optional[Callable[[], None]]`	用于停止工作进程的可调用对象。仅当 `blocking=False` 时返回。

strict ¶

strict(value: bool = True) -> None

设置严格模式。

启用严格模式时，会显示更多警告。

参数

名称	类型	描述	默认值
`value`	`bool`	严格模式的新值。	`True`

ModelIOOptions `dataclass` ¶

ModelIOOptions(file_prefix: Optional[str] = None)

保存和加载 YDF 模型的高级选项。

属性

名称	类型	描述
`file_prefix`	`Optional[str]`	模型的可选前缀。文件前缀允许同一文件夹中存在多个模型。除边缘情况外，强烈不建议这样做。加载模型时，如果未指定前缀，则尽可能自动检测。保存模型时，除非明确指定，否则使用空字符串作为文件前缀。

file_prefix `class-attribute` `instance-attribute` ¶

file_prefix: Optional[str] = None

create_vertical_dataset ¶

create_vertical_dataset(
    data: InputDataset,
    columns: ColumnDefs = None,
    include_all_columns: bool = False,
    max_vocab_count: int = 2000,
    min_vocab_frequency: int = 5,
    discretize_numerical_columns: bool = False,
    num_discretized_numerical_bins: int = 255,
    max_num_scanned_rows_to_infer_semantic: int = 100000,
    max_num_scanned_rows_to_compute_statistics: int = 100000,
    data_spec: Optional[DataSpecification] = None,
    required_columns: Optional[Sequence[str]] = None,
    dont_unroll_columns: Optional[Sequence[str]] = None,
    label: Optional[str] = None,
) -> VerticalDataset

从各种数据源创建 VerticalDataset。

特征语义会自动确定，并可以通过 columns 参数显式设置。数据集（或模型）的语义在其 data_spec 中可用。

注意，从文件读取时不会自动推断 CATEGORICAL_SET 语义。从 CSV 文件读取时，为特征设置 CATEGORICAL_SET 语义将使 YDF 对该特征进行分词。从内存数据集（例如 pandas）读取时，YDF 仅接受 CATEGORICAL_SET 特征的列表嵌套列表。

使用示例

import pandas as pd
import ydf

df = pd.read_csv("my_dataset.csv")

# Loads all the columns
ds = ydf.create_vertical_dataset(df)

# Only load columns "a" and "b". Ensure "b" is interpreted as a categorical
# feature.
ds = ydf.create_vertical_dataset(df,
  columns=[
    "a",
    ("b", ydf.semantic.categorical),
  ])

参数

名称	类型	描述	默认值
`data`	`InputDataset`	源数据集。支持的格式：VerticalDataset，（带类型）路径，（带类型）路径列表，Pandas DataFrame，Xarray Dataset，TensorFlow Dataset，PyGrain DataLoader 和 Dataset（实验性，仅限 Linux），字符串到 NumPy 数组或列表的字典。如果数据已经是 VerticalDataset，则原样返回。	必需
`columns`	`ColumnDefs`	如果为 None，则导入所有列。列的语义自动确定。否则，如果 include_all_columns=False（默认），则仅导入 columns 中列出的列。如果 include_all_columns=True，则导入所有列，并且仅自动确定不在 columns 中的列的语义。如果指定了 "columns"，则定义列的顺序 - 任何未列出的列（如果 include_all_columns=True）将按顺序添加到指定列之后。	`None`
`include_all_columns`	`bool`	参见 `columns`。	`False`
`max_vocab_count`	`int`	以字符串形式存储的 CATEGORICAL 和 CATEGORICAL_SET 列的词汇表最大大小。如果存在更多唯一值，则仅保留频率最高的值，其余值视为词汇外。如果 max_vocab_count = -1，则列中的值数量不受限制（不推荐）。	`2000`
`min_vocab_frequency`	`int`	CATEGORICAL 和 CATEGORICAL_SET 列中值的最小出现次数。出现次数少于 `min_vocab_frequency` 的值被视为词汇外。	`5`
`discretize_numerical_columns`	`bool`	如果为 true，则在训练前离散化所有数值列。离散化数值列训练更快，但可能对模型质量产生负面影响。使用 `discretize_numerical_columns=True` 等同于在 `column` 参数中设置列语义为 DISCRETIZED_NUMERICAL。更多详细信息请参见 DISCRETIZED_NUMERICAL 的定义。	`False`
`num_discretized_numerical_bins`	`int`	离散化数值列时使用的 bin 数量。	`255`
`max_num_scanned_rows_to_infer_semantic`	`int`	未显式指定时，推断列语义时要扫描的行数。仅在从文件读取时使用，内存数据集始终完全读取。将其设置为较小的值会加快数据集读取速度，但可能导致列语义不正确。设置为 -1 可扫描整个数据集。	`100000`
`max_num_scanned_rows_to_compute_statistics`	`int`	计算列统计信息时要扫描的行数。仅在从文件读取时使用，内存数据集始终完全读取。列统计信息包括类别特征的字典以及数值特征的均值/最小值/最大值。将其设置为较小的值会加快数据集读取速度，但会使 dataspec 中的统计信息失真，从而可能损害模型质量（例如，如果某个重要的类别特征被视为 OOV）。设置为 -1 可扫描整个数据集。	`100000`
`data_spec`	`Optional[DataSpecification]`	此数据集要使用的数据规范。如果提供了数据规范，则不应提供除 `data` 和 `required_columns` 之外的所有其他参数。	`None`
`required_columns`	`Optional[Sequence[str]]`	数据中必需的列列表。如果为 None，则数据规范或 `columns` 中提及的所有列都是必需的。	`None`
`dont_unroll_columns`	`Optional[Sequence[str]]`	无法展开的列列表。如果需要展开其中一个此类列，则引发错误。	`None`
`label`	`Optional[str]`	标签列的名称（如果有）。	`None`

返回值

类型	描述
`VerticalDataset`	供学习器算法摄取的数据集。

引发

类型	描述
`ValueError`	如果数据集类型不受支持。

ModelMetadata `dataclass` ¶

ModelMetadata(
    owner: Optional[str] = None,
    created_date: Optional[int] = None,
    uid: Optional[int] = None,
    framework: Optional[str] = None,
)

存储在模型中的元数据信息。

属性

名称	类型	描述
`owner`	`Optional[str]`	模型的拥有者，YDF 的开源版本默认为空字符串。
`created_date`	`Optional[int]`	模型训练的 Unix 时间戳（以秒为单位）。
`uid`	`Optional[int]`	模型的唯一标识符。
`framework`	`Optional[str]`	用于创建模型的框架。使用 Python API 训练的模型默认为“Python YDF”。

created_date `class-attribute` `instance-attribute` ¶

created_date: Optional[int] = None

framework `class-attribute` `instance-attribute` ¶

framework: Optional[str] = None

owner `class-attribute` `instance-attribute` ¶

owner: Optional[str] = None

uid `class-attribute` `instance-attribute` ¶

uid: Optional[int] = None

from_tensorflow_decision_forests ¶

from_tensorflow_decision_forests(
    directory: str,
) -> ModelType

从磁盘加载 TensorFlow Decision Forests 模型。

使用示例

import pandas as pd
import ydf

# Import TF-DF model
loaded_model = ydf.from_tensorflow_decision_forests("/tmp/my_tfdf_model")

# Make predictions
dataset = pd.read_csv("my_dataset.csv")
model.predict(dataset)

# Show details about the model
model.describe()

导入的模型生成与原始 TF-DF 模型相同的预测结果。

仅支持包含单个决策森林且不包含其他内容的 TensorFlow Decision Forests 模型。也就是说，无法导入组合神经网络/决策森林模型。不幸的是，导入此类模型可能会成功，但会导致预测结果不正确，因此导入后请检查预测是否一致。

参数

名称	类型	描述	默认值
`directory`	`str`	包含 TF-DF 模型的目录。	必需

返回值

类型	描述
`ModelType`	用于推理、评估或检查的模型

from_sklearn ¶

from_sklearn(
    sklearn_model: Any,
    label_name: str = "label",
    feature_name: str = "features",
) -> GenericModel

将基于树的 scikit-learn 模型转换为 YDF 模型。

使用示例

import ydf
from sklearn import datasets
from sklearn import tree

# Train a SKLearn model
X, y = datasets.make_classification()
skl_model = tree.DecisionTreeClassifier().fit(X, y)

# Convert the SKLearn model to a YDF model
ydf_model = ydf.from_sklearn(skl_model)

# Make predictions with the YDF model
ydf_predictions = ydf_model.predict({"features": X})

# Analyse the YDF model
ydf_model.analyze({"features": X})

当前支持的模型包括：* sklearn.tree.DecisionTreeClassifier * sklearn.tree.DecisionTreeRegressor * sklearn.tree.ExtraTreeClassifier * sklearn.tree.ExtraTreeRegressor * sklearn.ensemble.RandomForestClassifier * sklearn.ensemble.RandomForestRegressor * sklearn.ensemble.ExtraTreesClassifier * sklearn.ensemble.ExtraTreesRegressor * sklearn.ensemble.GradientBoostingRegressor * sklearn.ensemble.IsolationForest

与 YDF 不同，Scikit-learn 不对特征和标签命名。使用字段 label_name 和 feature_name 指定 YDF 模型中列的名称。

此外，仅支持单标签分类和标量回归（例如，多变量回归模型将无法转换）。

参数

名称	类型	描述	默认值
`sklearn_model`	`Any`	待转换的基于树的 scikit-learn 模型。	必需
`label_name`	`str`	输出 YDF 模型中标签的名称。	`'label'`
`feature_name`	`str`	输出 YDF 模型中多维特征的名称。	`'features'`

返回值

类型	描述
`GenericModel`	模拟提供的 scikit-learn 模型的 YDF 模型。

NodeFormat ¶

基类: Enum

模型的序列化格式。

确定节点的存储格式。

属性

名称	类型	描述
`BLOB_SEQUENCE`		YDF 公开版本的默认格式。

BLOB_SEQUENCE `class-attribute` `instance-attribute` ¶

BLOB_SEQUENCE = auto()

BLOB_SEQUENCE_GZIP `class-attribute` `instance-attribute` ¶

BLOB_SEQUENCE_GZIP = auto()

DataSpecification `module-attribute` ¶

DataSpecification = DataSpecification

TrainingConfig `module-attribute` ¶

TrainingConfig = TrainingConfig

RegressionLoss `dataclass` ¶

RegressionLoss(
    activation: Activation,
    initial_predictions: Callable[
        [NDArray[float32], NDArray[float32]], float32
    ],
    loss: Callable[
        [
            NDArray[float32],
            NDArray[float32],
            NDArray[float32],
        ],
        float32,
    ],
    gradient_and_hessian: Callable[
        [NDArray[float32], NDArray[float32]],
        Tuple[NDArray[float32], NDArray[float32]],
    ],
    may_trigger_gc: bool = True,
)

基类: AbstractCustomLoss

用户提供的用于回归问题的损失函数。

损失函数返回后不得引用其外部参数：不良示例

mylabels = None
def initial_predictions(labels, weights):
  nonlocal mylabels
  mylabels = labels  # labels is now referenced outside the function

良好示例

mylabels = None
def initial_predictions(labels, weights):
  nonlocal mylabels
  mylabels = np.copy(labels)  # mylabels is a copy, not a reference.

属性

名称	类型	描述
`initial_predictions`	`Callable[[NDArray[float32], NDArray[float32]], float32]`	GBT 模型的偏差/初始预测。接收标签值和权重，输出初始预测作为一个浮点数。
`loss`	`Callable[[NDArray[float32], NDArray[float32], NDArray[float32]], float32]`	损失函数控制提前停止。损失函数接收标签、当前预测和当前权重，并且必须输出损失作为浮点数。请注意，提供给损失函数的预测尚未应用激活函数。
`gradient_and_hessian`	`Callable[[NDArray[float32], NDArray[float32]], Tuple[NDArray[float32], NDArray[float32]]]`	当前预测的梯度和 Hessian。请注意，只需提供 Hessian 的对角线即可。输入为标签和当前预测（未激活），返回一个包含梯度和 Hessian 的元组。
`activation`	`Activation`	应用于模型的激活函数。回归模型在应用激活函数后，预期返回与标签在同一空间中的值。
`may_trigger_gc`	`bool`	如果为 True（默认），YDF 可能会触发 Python 的垃圾回收，以确定由 YDF 内部数据支持的 Numpy 数组是否在其生命周期结束后被使用。如果为 False，则禁用非法内存访问检查。这在训练许多小型模型或触发垃圾回收的观察影响很大时很有用。如果 `may_trigger_gc=False`，用户手动验证没有发生内存泄漏非常重要。

activation `instance-attribute` ¶

activation: Activation

gradient_and_hessian `instance-attribute` ¶

gradient_and_hessian: Callable[
    [NDArray[float32], NDArray[float32]],
    Tuple[NDArray[float32], NDArray[float32]],
]

initial_predictions `instance-attribute` ¶

initial_predictions: Callable[
    [NDArray[float32], NDArray[float32]], float32
]

loss `instance-attribute` ¶

loss: Callable[
    [NDArray[float32], NDArray[float32], NDArray[float32]],
    float32,
]

may_trigger_gc `class-attribute` `instance-attribute` ¶

may_trigger_gc: bool = True

check_is_compatible_task ¶

check_is_compatible_task(task: Task) -> None

如果给定任务与此损失类型不兼容，则引发错误。

BinaryClassificationLoss `dataclass` ¶

BinaryClassificationLoss(
    activation: Activation,
    initial_predictions: Callable[
        [NDArray[int32], NDArray[float32]], float32
    ],
    loss: Callable[
        [
            NDArray[int32],
            NDArray[float32],
            NDArray[float32],
        ],
        float32,
    ],
    gradient_and_hessian: Callable[
        [NDArray[int32], NDArray[float32]],
        Tuple[NDArray[float32], NDArray[float32]],
    ],
    may_trigger_gc: bool = True,
)

基类: AbstractCustomLoss

用户提供的用于二分类问题的损失函数。

请注意，标签是二进制的，但基于 1，即正类为 2，负类为 1。

损失函数返回后不得引用其外部参数：不良示例

mylabels = None
def initial_predictions(labels, weights):
  nonlocal mylabels
  mylabels = labels  # labels is now referenced outside the function

良好示例

mylabels = None
def initial_predictions(labels, weights):
  nonlocal mylabels
  mylabels = np.copy(labels)  # mylabels is a copy, not a reference.

属性

名称	类型	描述
`initial_predictions`	`Callable[[NDArray[int32], NDArray[float32]], float32]`	GBT 模型的偏差/初始预测。接收标签值和权重，输出初始预测作为一个浮点数。
`loss`	`Callable[[NDArray[int32], NDArray[float32], NDArray[float32]], float32]`	损失函数控制提前停止。损失函数接收标签、当前预测和当前权重，并且必须输出损失作为浮点数。请注意，提供给损失函数的预测尚未应用激活函数。
`gradient_and_hessian`	`Callable[[NDArray[int32], NDArray[float32]], Tuple[NDArray[float32], NDArray[float32]]]`	当前预测的梯度和 Hessian。请注意，只需提供 Hessian 的对角线即可。输入为标签和当前预测（未激活）。返回一个包含梯度和 Hessian 的元组。
`activation`	`Activation`	应用于模型的激活函数。二分类模型在应用激活函数后，预期返回一个概率。
`may_trigger_gc`	`bool`	如果为 True（默认），YDF 可能会触发 Python 的垃圾回收，以确定由 YDF 内部数据支持的 Numpy 数组是否在其生命周期结束后被使用。如果为 False，则禁用非法内存访问检查。将此参数设置为 False 是危险的，因为非法内存访问将不再被检测到。

activation `instance-attribute` ¶

activation: Activation

gradient_and_hessian `instance-attribute` ¶

gradient_and_hessian: Callable[
    [NDArray[int32], NDArray[float32]],
    Tuple[NDArray[float32], NDArray[float32]],
]

initial_predictions `instance-attribute` ¶

initial_predictions: Callable[
    [NDArray[int32], NDArray[float32]], float32
]

loss `instance-attribute` ¶

loss: Callable[
    [NDArray[int32], NDArray[float32], NDArray[float32]],
    float32,
]

may_trigger_gc `class-attribute` `instance-attribute` ¶

may_trigger_gc: bool = True

check_is_compatible_task ¶

check_is_compatible_task(task: Task) -> None

如果给定任务与此损失类型不兼容，则引发错误。

MultiClassificationLoss `dataclass` ¶

MultiClassificationLoss(
    activation: Activation,
    initial_predictions: Callable[
        [NDArray[int32], NDArray[float32]], NDArray[float32]
    ],
    loss: Callable[
        [
            NDArray[int32],
            NDArray[float32],
            NDArray[float32],
        ],
        float32,
    ],
    gradient_and_hessian: Callable[
        [NDArray[int32], NDArray[float32]],
        Tuple[NDArray[float32], NDArray[float32]],
    ],
    may_trigger_gc: bool = True,
)

基类: AbstractCustomLoss

用户提供的用于多分类问题的损失函数。

请注意，标签基于 1。预测结果以一个二维数组给出，每行对应一个样本。初始预测、梯度和 Hessian 预计为每个类别提供，例如，对于一个 3 类分类问题，每个类别输出 3 个梯度和 Hessian。

损失函数返回后不得引用其外部参数：不良示例

mylabels = None
def initial_predictions(labels, weights):
  nonlocal mylabels
  mylabels = labels  # labels is now referenced outside the function

良好示例

mylabels = None
def initial_predictions(labels, weights):
  nonlocal mylabels
  mylabels = np.copy(labels)  # mylabels is a copy, not a reference.

属性

名称	类型	描述
`initial_predictions`	`Callable[[NDArray[int32], NDArray[float32]], NDArray[float32]]`	GBT 模型的偏差/初始预测。接收标签值和权重，输出初始预测作为浮点数数组（每个类别一个初始预测）。
`loss`	`Callable[[NDArray[int32], NDArray[float32], NDArray[float32]], float32]`	损失函数控制提前停止。损失函数接收标签、当前预测和当前权重，并且必须输出损失作为浮点数。请注意，提供给损失函数的预测尚未应用激活函数。
`gradient_and_hessian`	`Callable[[NDArray[int32], NDArray[float32]], Tuple[NDArray[float32], NDArray[float32]]]`	当前预测相对于每个类别的梯度和 Hessian。请注意，只需提供 Hessian 的对角线即可。输入为标签和当前预测（未激活）。返回一个包含梯度和 Hessian 的元组。梯度和 Hessian 都必须是形状为 (num_classes, num_examples) 的数组。
`activation`	`Activation`	应用于模型的激活函数。多分类模型在应用激活函数后，预期返回类别的概率分布。
`may_trigger_gc`	`bool`	如果为 True（默认），YDF 可能会触发 Python 的垃圾回收，以确定由 YDF 内部数据支持的 Numpy 数组是否在其生命周期结束后被使用。如果为 False，则禁用非法内存访问检查。将此参数设置为 False 是危险的，因为非法内存访问将不再被检测到。

activation `instance-attribute` ¶

activation: Activation

gradient_and_hessian `instance-attribute` ¶

gradient_and_hessian: Callable[
    [NDArray[int32], NDArray[float32]],
    Tuple[NDArray[float32], NDArray[float32]],
]

initial_predictions `instance-attribute` ¶

initial_predictions: Callable[
    [NDArray[int32], NDArray[float32]], NDArray[float32]
]

loss `instance-attribute` ¶

loss: Callable[
    [NDArray[int32], NDArray[float32], NDArray[float32]],
    float32,
]

may_trigger_gc `class-attribute` `instance-attribute` ¶

may_trigger_gc: bool = True

check_is_compatible_task ¶

check_is_compatible_task(task: Task) -> None

如果给定任务与此损失类型不兼容，则引发错误。

Activation ¶

基类: Enum

用于自定义损失的激活函数。

并非所有激活函数都支持所有自定义损失。激活函数 IDENTITY（即未应用激活函数）始终支持。

IDENTITY `class-attribute` `instance-attribute` ¶

IDENTITY = 'IDENTITY'

SIGMOID `class-attribute` `instance-attribute` ¶

SIGMOID = 'SIGMOID'

SOFTMAX `class-attribute` `instance-attribute` ¶

SOFTMAX = 'SOFTMAX'

实用工具¶

verbose ¶

load_model ¶

deserialize_model ¶

Feature dataclass ¶

max_vocab_count class-attribute instance-attribute ¶

min_vocab_frequency class-attribute instance-attribute ¶

monotonic class-attribute instance-attribute ¶

name instance-attribute ¶

normalized_monotonic property ¶

num_discretized_numerical_bins class-attribute instance-attribute ¶

semantic class-attribute instance-attribute ¶

from_column_def classmethod ¶

to_proto_column_guide ¶

Column dataclass ¶

max_vocab_count class-attribute instance-attribute ¶

min_vocab_frequency class-attribute instance-attribute ¶

monotonic class-attribute instance-attribute ¶

name instance-attribute ¶

normalized_monotonic property ¶

num_discretized_numerical_bins class-attribute instance-attribute ¶

semantic class-attribute instance-attribute ¶

from_column_def classmethod ¶

to_proto_column_guide ¶

Task ¶

ANOMALY_DETECTION class-attribute instance-attribute ¶

CATEGORICAL_UPLIFT class-attribute instance-attribute ¶

CLASSIFICATION class-attribute instance-attribute ¶

NUMERICAL_UPLIFT class-attribute instance-attribute ¶

RANKING class-attribute instance-attribute ¶

REGRESSION class-attribute instance-attribute ¶

Semantic ¶

BOOLEAN class-attribute instance-attribute ¶

CATEGORICAL class-attribute instance-attribute ¶

CATEGORICAL_SET class-attribute instance-attribute ¶

DISCRETIZED_NUMERICAL class-attribute instance-attribute ¶

HASH class-attribute instance-attribute ¶

NUMERICAL class-attribute instance-attribute ¶

NUMERICAL_VECTOR_SEQUENCE class-attribute instance-attribute ¶

from_proto_type classmethod ¶

to_proto_type ¶

evaluate_predictions ¶

start_worker ¶

strict ¶

ModelIOOptions dataclass ¶

file_prefix class-attribute instance-attribute ¶

create_vertical_dataset ¶

ModelMetadata dataclass ¶

created_date class-attribute instance-attribute ¶

framework class-attribute instance-attribute ¶

owner class-attribute instance-attribute ¶

uid class-attribute instance-attribute ¶

from_tensorflow_decision_forests ¶

from_sklearn ¶

NodeFormat ¶

BLOB_SEQUENCE class-attribute instance-attribute ¶

BLOB_SEQUENCE_GZIP class-attribute instance-attribute ¶

DataSpecification module-attribute ¶

TrainingConfig module-attribute ¶

RegressionLoss dataclass ¶

activation instance-attribute ¶

gradient_and_hessian instance-attribute ¶

initial_predictions instance-attribute ¶

loss instance-attribute ¶

may_trigger_gc class-attribute instance-attribute ¶

check_is_compatible_task ¶

BinaryClassificationLoss dataclass ¶

activation instance-attribute ¶

gradient_and_hessian instance-attribute ¶

initial_predictions instance-attribute ¶

loss instance-attribute ¶

may_trigger_gc class-attribute instance-attribute ¶

check_is_compatible_task ¶

MultiClassificationLoss dataclass ¶

activation instance-attribute ¶

gradient_and_hessian instance-attribute ¶

initial_predictions instance-attribute ¶

loss instance-attribute ¶

may_trigger_gc class-attribute instance-attribute ¶

check_is_compatible_task ¶

Feature `dataclass` ¶

max_vocab_count `class-attribute` `instance-attribute` ¶

min_vocab_frequency `class-attribute` `instance-attribute` ¶

monotonic `class-attribute` `instance-attribute` ¶

name `instance-attribute` ¶

normalized_monotonic `property` ¶

num_discretized_numerical_bins `class-attribute` `instance-attribute` ¶

semantic `class-attribute` `instance-attribute` ¶

from_column_def `classmethod` ¶

Column `dataclass` ¶

max_vocab_count `class-attribute` `instance-attribute` ¶

min_vocab_frequency `class-attribute` `instance-attribute` ¶

monotonic `class-attribute` `instance-attribute` ¶

name `instance-attribute` ¶

normalized_monotonic `property` ¶

num_discretized_numerical_bins `class-attribute` `instance-attribute` ¶

semantic `class-attribute` `instance-attribute` ¶

from_column_def `classmethod` ¶

ANOMALY_DETECTION `class-attribute` `instance-attribute` ¶

CATEGORICAL_UPLIFT `class-attribute` `instance-attribute` ¶

CLASSIFICATION `class-attribute` `instance-attribute` ¶

NUMERICAL_UPLIFT `class-attribute` `instance-attribute` ¶

RANKING `class-attribute` `instance-attribute` ¶

REGRESSION `class-attribute` `instance-attribute` ¶

BOOLEAN `class-attribute` `instance-attribute` ¶

CATEGORICAL `class-attribute` `instance-attribute` ¶

CATEGORICAL_SET `class-attribute` `instance-attribute` ¶

DISCRETIZED_NUMERICAL `class-attribute` `instance-attribute` ¶

HASH `class-attribute` `instance-attribute` ¶

NUMERICAL `class-attribute` `instance-attribute` ¶

NUMERICAL_VECTOR_SEQUENCE `class-attribute` `instance-attribute` ¶

from_proto_type `classmethod` ¶

ModelIOOptions `dataclass` ¶

file_prefix `class-attribute` `instance-attribute` ¶

ModelMetadata `dataclass` ¶

created_date `class-attribute` `instance-attribute` ¶

framework `class-attribute` `instance-attribute` ¶

owner `class-attribute` `instance-attribute` ¶

uid `class-attribute` `instance-attribute` ¶

BLOB_SEQUENCE `class-attribute` `instance-attribute` ¶

BLOB_SEQUENCE_GZIP `class-attribute` `instance-attribute` ¶

DataSpecification `module-attribute` ¶

TrainingConfig `module-attribute` ¶

RegressionLoss `dataclass` ¶

activation `instance-attribute` ¶

gradient_and_hessian `instance-attribute` ¶

initial_predictions `instance-attribute` ¶

loss `instance-attribute` ¶

may_trigger_gc `class-attribute` `instance-attribute` ¶

BinaryClassificationLoss `dataclass` ¶

activation `instance-attribute` ¶

gradient_and_hessian `instance-attribute` ¶

initial_predictions `instance-attribute` ¶

loss `instance-attribute` ¶

may_trigger_gc `class-attribute` `instance-attribute` ¶

MultiClassificationLoss `dataclass` ¶

activation `instance-attribute` ¶

gradient_and_hessian `instance-attribute` ¶

initial_predictions `instance-attribute` ¶

loss `instance-attribute` ¶

may_trigger_gc `class-attribute` `instance-attribute` ¶

IDENTITY `class-attribute` `instance-attribute` ¶

SIGMOID `class-attribute` `instance-attribute` ¶

SOFTMAX `class-attribute` `instance-attribute` ¶