RandomForestModel

RandomForestModel ¶

RandomForestModel(raw_model: GenericCCModel)

基类：DecisionForestModel

用于预测和检查的 Random Forest 模型。

add_tree ¶

add_tree(tree: Tree) -> None

向模型添加单个树。

参数

名称	类型	描述	默认值
`tree`	`Tree`	新的树。	必需

analyze ¶

analyze(
    data: InputDataset,
    sampling: float = 1.0,
    num_bins: int = 50,
    partial_dependence_plot: bool = True,
    conditional_expectation_plot: bool = True,
    permutation_variable_importance_rounds: int = 1,
    num_threads: Optional[int] = None,
    maximum_duration: Optional[float] = 20,
) -> Analysis

analyze_prediction ¶

analyze_prediction(
    single_example: InputDataset,
) -> PredictionAnalysis

benchmark ¶

benchmark(
    ds: InputDataset,
    benchmark_duration: float = 3,
    warmup_duration: float = 1,
    batch_size: int = 100,
    num_threads: Optional[int] = None,
) -> BenchmarkInferenceCCResult

data_spec ¶

data_spec() -> DataSpecification

describe ¶

describe(
    output_format: Literal[
        "auto", "text", "notebook", "html"
    ] = "auto",
    full_details: bool = False,
) -> Union[str, HtmlNotebookDisplay]

distance ¶

distance(
    data1: InputDataset,
    data2: Optional[InputDataset] = None,
) -> ndarray

计算“data1”和“data2”中样本之间的成对距离。

如果未提供“data2”，则计算“data1”中样本之间的成对距离。

使用示例

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").Train(train_ds)

test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.

不同的模型可以自由实现具有不同定义的距离。因此，除非模型指定，否则不同模型之间的距离不可比较。

不保证此距离满足度量距离的三角不等式性质。

并非所有模型都能计算距离。在这种情况下，此函数将引发异常。

参数

名称	类型	描述	默认值
`data1`	`InputDataset`	数据集。可以是列表或 numpy 数组值的字典、Pandas DataFrame 或 VerticalDataset。	必需
`data2`	`Optional[InputDataset]`	数据集。可以是列表或 numpy 数组值的字典、Pandas DataFrame 或 VerticalDataset。	`None`

返回值

类型	描述
`ndarray`	成对距离

evaluate ¶

evaluate(
    data: InputDataset,
    *,
    weighted: Optional[bool] = None,
    task: Optional[Task] = None,
    label: Optional[str] = None,
    group: Optional[str] = None,
    bootstrapping: Union[bool, int] = False,
    ndcg_truncation: int = 5,
    mrr_truncation: int = 5,
    evaluation_task: Optional[Task] = None,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> Evaluation

feature_selection_logs ¶

feature_selection_logs() -> Optional[FeatureSelectorLogs]

force_engine ¶

force_engine(engine_name: Optional[str]) -> None

get_all_trees ¶

get_all_trees() -> Sequence[Tree]

返回模型中的所有树。

get_tree ¶

get_tree(tree_idx: int) -> Tree

获取模型中的单个树。

参数

名称	类型	描述	默认值
`tree_idx`	`int`	树的索引。应在 [0, num_trees()) 范围内。	必需

返回值

类型	描述
`Tree`	该树。

hyperparameter_optimizer_logs ¶

hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]

input_feature_names ¶

input_feature_names() -> List[str]

返回输入特征的名称。

特征按 column_idx 递增顺序排序。

input_features ¶

input_features() -> Sequence[InputFeature]

返回模型的输入特征。

特征按 column_idx 递增顺序排序。

input_features_col_idxs ¶

input_features_col_idxs() -> Sequence[int]

iter_trees ¶

iter_trees() -> Iterator[Tree]

返回模型中所有树的迭代器。

label ¶

label() -> str

标签列的名称。

label_classes ¶

label_classes() -> List[str]

返回分类模型的标签类别；否则失败。

label_col_idx ¶

label_col_idx() -> int

list_compatible_engines ¶

list_compatible_engines() -> Sequence[str]

metadata ¶

metadata() -> ModelMetadata

name ¶

name() -> str

num_trees ¶

num_trees()

返回决策森林中的树的数量。

out_of_bag_evaluations ¶

out_of_bag_evaluations() -> Sequence[OutOfBagEvaluation]

如果可用，返回模型的袋外 (Out-Of-Bag) 评估结果。

随机森林中的每棵树仅在部分训练样本上进行训练。袋外 (OOB) 评估会使用未在训练中见过每个训练样本的树来评估该样本。这创建了一种不需要训练数据集的自评估方法。详情请参阅 https://developers.google.com/machine-learning/decision-forests/out-of-bag。

计算 OOB 指标会减慢训练速度，并且需要设置超参数 compute_oob_performances。学习器随后会在训练期间定期计算 OOB 评估。返回的评估列表按树的数量排序，其最后一个元素是完整模型的 OOB 评估。

如果未计算 OOB 评估，则返回一个空列表。

使用示例

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
learner = ydf.RandomForestLearner(label="label",
                                  compute_oob_performances=True)
model = learner.train(train_ds)

oob_evaluations = model.out_of_bag_evaluations()
# In an interactive Python environment, print a rich evaluation report.
oob_evaluations[-1].evaluation

plot_tree ¶

plot_tree(
    tree_idx: int = 0,
    max_depth: Optional[int] = None,
    options: Optional[PlotOptions] = None,
    d3js_url: str = "https://d3js.cn/d3.v6.min.js",
) -> TreePlot

绘制树的交互式 HTML 渲染图。

使用示例

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()

参数

名称	类型	描述	默认值
`tree_idx`	`int`	树的索引。应在 [0, self.num_trees()) 范围内。	`0`
`max_depth`	`Optional[int]`	图的最大树深度。设置为 None 表示完整深度。	`None`
`options`	`Optional[PlotOptions]`	绘制的高级选项。设置为 None 表示默认样式。	`None`
`d3js_url`	`str`	加载 d3.js 库的 URL。	`'https://d3js.cn/d3.v6.min.js'`

返回值

类型	描述
`TreePlot`	在交互式环境中，一个交互式图。HTML 源代码也可以
`TreePlot`	导出到文件。

predict ¶

predict(
    data: InputDataset,
    *,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> ndarray

predict_class ¶

predict_class(
    data: InputDataset,
    *,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> ndarray

返回分类模型最可能的预测类别。

使用示例

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
predictions = model.predict_class(test_ds)

此方法返回一个 shape 为 [num_examples] 的字符串 numpy 数组。每个值表示对应样本最可能的类别。此方法仅适用于分类模型。

如果出现平局，则返回model.label_classes() 中的第一个类别。

请参阅 model.predict 以生成完整的预测概率。

参数

名称	类型	描述	默认值
`数据集。支持的格式：VerticalDataset、(typed) path、（typed）path 列表、Pandas DataFrame、Xarray Dataset、TensorFlow Dataset、PyGrain DataLoader 和 Dataset（实验性，仅限 Linux）、string 到 NumPy 数组或列表的字典。如果数据集包含标签列，则该列将被忽略。`	`InputDataset`	数据集。支持的格式：VerticalDataset、(类型化)路径、(类型化)路径列表、Pandas DataFrame、Xarray Dataset、TensorFlow Dataset、PyGrain DataLoader 和 Dataset (实验性，仅限 Linux)、字符串到 NumPy 数组或列表的字典。如果数据集包含标签列，该列将被忽略。	必需
`use_slow_engine`	`bool`	如果为 true，则使用慢速引擎进行预测。YDF 的慢速引擎比其他预测引擎慢一个数量级。在极少数边缘情况下，使用常规引擎的预测可能会失败，例如具有大量类别条件的模型。仅在这种情况下，用户才应使用慢速引擎并将问题报告给 YDF 开发人员。	`False`
`num_threads`	`Optional[int]`	运行模型所用的线程数。	`None`

返回值

类型	描述
`ndarray`	每个样本最可能的预测类别。

predict_leaves ¶

predict_leaves(data: InputDataset) -> ndarray

获取每棵树中活动叶节点的索引。

活动叶节点是在推理过程中接收样本的叶节点。

返回值“leaves[i,j]”是第 i 个样本和第 j 棵树的活动叶节点的索引。叶节点按深度优先遍历进行索引，其中负子节点在正子节点之前访问。

参数

名称	类型	描述	默认值
`数据集。支持的格式：VerticalDataset、(typed) path、（typed）path 列表、Pandas DataFrame、Xarray Dataset、TensorFlow Dataset、PyGrain DataLoader 和 Dataset（实验性，仅限 Linux）、string 到 NumPy 数组或列表的字典。如果数据集包含标签列，则该列将被忽略。`	`InputDataset`	数据集。	必需

返回值

类型	描述
`ndarray`	模型中每棵树的活动叶节点索引。

print_tree ¶

print_tree(
    tree_idx: int = 0,
    max_depth: Optional[int] = 6,
    file: Any = stdout,
) -> None

在终端中打印树。

使用示例

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()

参数

名称	类型	描述	默认值
`tree_idx`	`int`	树的索引。应在 [0, self.num_trees()) 范围内。	`0`
`max_depth`	`Optional[int]`	图的最大树深度。设置为 None 表示完整深度。	`6`
`file`	`Any`	打印树的位置。默认情况下，打印到终端标准输出。	`stdout`

remove_tree ¶

remove_tree(tree_idx: int) -> None

移除模型的单个树。

参数

名称	类型	描述	默认值
`tree_idx`	`int`	树的索引。应在 [0, num_trees()) 范围内。	必需

save ¶

save(
    path: str,
    advanced_options=ModelIOOptions(),
    *,
    pure_serving=False
) -> None

self_evaluation ¶

self_evaluation() -> Optional[Evaluation]

返回模型的自评估结果。

对于 Random Forest 模型，自评估是完整模型的袋外评估。请注意，Random Forest 模型不使用验证数据集。如果未启用袋外评估，则不计算自评估。

不同的模型使用不同的方法进行自评估。值得注意的是，Gradient Boosted Trees 使用验证数据集上的评估。因此，不同模型类型之间的自评估结果不可比较。

如果未计算自评估，则返回 None。

使用示例

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
learner = ydf.RandomForestLearner(label="label",
                                compute_oob_performances=True)
model = learner.train(train_ds)

self_evaluation = model.self_evaluation()
# In an interactive Python environment, print a rich evaluation report.
self_evaluation

serialize ¶

serialize() -> bytes

set_data_spec ¶

set_data_spec(data_spec: DataSpecification) -> None

set_feature_selection_logs ¶

set_feature_selection_logs(
    value: Optional[FeatureSelectorLogs],
) -> None

set_metadata ¶

set_metadata(metadata: ModelMetadata)

set_node_format ¶

set_node_format(node_format: NodeFormat) -> None

设置节点的序列化格式。

参数

名称	类型	描述	默认值
`node_format`	`NodeFormat`	保存模型时使用的节点格式。	必需

set_tree ¶

set_tree(tree_idx: int, tree: Tree) -> None

覆盖模型的单个树。

参数

名称	类型	描述	默认值
`tree_idx`	`int`	树的索引。应在 [0, num_trees()) 范围内。	必需
`tree`	`Tree`	新的树。	必需

task ¶

task() -> Task

to_cpp ¶

to_cpp(key: str = 'my_model') -> str

to_docker ¶

to_docker(path: str, exist_ok: bool = False) -> None

将模型导出到可在云端部署的 Docker endpoint。

此函数会创建一个目录，其中包含 Dockerfile、模型和支持文件。

使用示例

import ydf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
})

# Export the model to a Docker endpoint.
model.to_docker(path="/tmp/my_model")

# Print instructions on how to use the model
!cat /tmp/my_model/readme.md

# Test the end-point locally
docker build --platform linux/amd64 -t ydf_predict_image /tmp/my_model
docker run --rm -p 8080:8080 -d ydf_predict_image

# Deploy the model on Google Cloud
gcloud run deploy ydf-predict --source /tmp/my_model

# Check the automatically created utility scripts "test_locally.sh" and
# "deploy_in_google_cloud.sh" for more examples.

参数

名称	类型	描述	默认值
`path`	`str`	创建 Docker endpoint 的目录	必需
`exist_ok`	`bool`	如果为 false（默认），如果目录已存在则失败。如果为 true，则覆盖目录中的现有内容。	`False`

to_jax_function ¶

to_jax_function(
    jit: bool = True,
    apply_activation: bool = True,
    leaves_as_params: bool = False,
    compatibility: Union[str, Compatibility] = "XLA",
) -> JaxModel

to_tensorflow_function ¶

to_tensorflow_function(
    temp_dir: Optional[str] = None,
    can_be_saved: bool = True,
    squeeze_binary_classification: bool = True,
    force: bool = False,
) -> Module

to_tensorflow_saved_model ¶

to_tensorflow_saved_model(
    path: str,
    input_model_signature_fn: Any = None,
    *,
    mode: Literal["keras", "tf"] = "keras",
    feature_dtypes: Dict[str, TFDType] = {},
    servo_api: bool = False,
    feed_example_proto: bool = False,
    pre_processing: Optional[Callable] = None,
    post_processing: Optional[Callable] = None,
    temp_dir: Optional[str] = None,
    tensor_specs: Optional[Dict[str, Any]] = None,
    feature_specs: Optional[Dict[str, Any]] = None,
    force: bool = False
) -> None

update_with_jax_params ¶

update_with_jax_params(params: Dict[str, Any])

variable_importances ¶

variable_importances() -> (
    Dict[str, List[Tuple[float, str]]]
)

winner_takes_all ¶

winner_takes_all() -> bool

返回模型在分类中是否使用赢者通吃策略。

此参数决定了在分类随机森林中进行推理时如何聚合单个树的投票。它由 winner_take_all Random Forest 学习器超参数定义，

如果为 true，则每棵树投票支持一个类别，这是传统的随机森林推理方法。如果为 false，则每棵树输出所有类别的概率分布。

如果模型不是分类模型，此函数的返回值是任意的，并且不影响模型推理。