多维¶

设置¶

In [ ]

已复制！

pip install ydf -U
pip install ydf -U

In [ ]

已复制！

import ydf
import numpy as np
import ydf import numpy as np

什么是多维特征？¶

多维特征是具有多个维度的模型输入。例如，输入时间序列的多个时间戳或图像中不同像素的值都是多维特征。它们不同于只有一个维度的单维特征。多维特征的每个维度都被视为一个独立的单维特征。

多维特征以多维数组的形式输入，例如 Numpy 数组或 TensorFlow 向量。接下来的示例展示了一个将多维特征输入模型的玩具示例。

创建多维数据集¶

创建多维数据集的最简单方法是使用多维 NumPy 数组的字典。

In [ ]

已复制！





def create_dataset(num_examples):
  # Generates random feature values.
  dataset = {
      # f1 is a 4 multi-dimensional feature.
      "f1": np.random.uniform(size=(num_examples, 4)),
      # f2 is a single-dimensional feature.
      "f2": np.random.uniform(size=(num_examples)),
  }

  # Add a synthetic label
  noise = np.random.uniform(size=num_examples)
  dataset["label"] = (
      np.sum(dataset["f1"], axis=1) + dataset["f2"] * 0.2 + noise
  ) >= 2.0
  return dataset


print("A dataset with 5 examples:")
create_dataset(num_examples=5)
def create_dataset(num_examples): # 生成随机特征值。 dataset = { # f1 是一个 4 维多维特征。 "f1": np.random.uniform(size=(num_examples, 4)), # f2 是一个单维特征。 "f2": np.random.uniform(size=(num_examples)), } # 添加合成标签 noise = np.random.uniform(size=num_examples) dataset["label"] = ( np.sum(dataset["f1"], axis=1) + dataset["f2"] * 0.2 + noise ) >= 2.0 return dataset print("包含 5 个样本的数据集：") create_dataset(num_examples=5)

A dataset with 5 examples:

Out[ ]

{'f1': array([[0.5373759 , 0.18098291, 0.74489824, 0.27706572],
        [0.4517745 , 0.37578001, 0.45156836, 0.05413219],
        [0.77036813, 0.1640734 , 0.47994649, 0.06315383],
        [0.44115416, 0.95749836, 0.80662146, 0.78114808],
        [0.40393628, 0.22786682, 0.32477702, 0.18309577]]),
 'f2': array([0.02058218, 0.94332705, 0.25678716, 0.02122367, 0.04498769]),
 'label': array([False,  True, False,  True, False])}

训练模型¶

在多维特征上训练模型与在单维特征上训练模型类似。

In [ ]

已复制！

train_ds = create_dataset(num_examples=10000)
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
train_ds = create_dataset(num_examples=10000) model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

Train model on 10000 examples
Model trained in 0:00:02.789326

模型理解¶

解释模型时，多维特征的每个维度都会被独立处理。例如，描述模型时会单独显示每个维度。

In [ ]

已复制！

model.describe()
model.describe()

Out[ ]

名称 : GRADIENT_BOOSTED_TREES
任务 : 分类
标签 : label
特征 (5) : f1.0_of_4 f1.1_of_4 f1.2_of_4 f1.3_of_4 f2
权重 : None
使用调优器训练 : 否
模型大小 : 767 kB

Number of records: 10000
Number of columns: 6

Number of columns by type:
	NUMERICAL: 5 (83.3333%)
	CATEGORICAL: 1 (16.6667%)

Columns:

NUMERICAL: 5 (83.3333%)
	1: "f1.0_of_4" NUMERICAL mean:0.49459 min:4.63251e-05 max:0.999917 sd:0.289597
	2: "f1.1_of_4" NUMERICAL mean:0.498703 min:5.8423e-06 max:0.999997 sd:0.289197
	3: "f1.2_of_4" NUMERICAL mean:0.498227 min:7.85791e-05 max:0.999943 sd:0.288629
	4: "f1.3_of_4" NUMERICAL mean:0.496773 min:9.6696e-05 max:0.99987 sd:0.28987
	5: "f2" NUMERICAL mean:0.504066 min:3.89178e-05 max:0.999976 sd:0.289052

CATEGORICAL: 1 (16.6667%)
	0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"true" 8140 (81.4%)

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

以下评估是在验证集或袋外数据集上计算的。

Task: CLASSIFICATION
Label: label
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.48424

Accuracy: 0.88592  CI95[W][0 1]
ErrorRate: : 0.11408


Confusion Table:
truth\prediction
       false  true
false    114    65
 true     46   748
Total: 973

变量重要性衡量输入特征对模型的重要性。

    1. "f1.3_of_4"  0.381709 ################
    2. "f1.0_of_4"  0.364288 ##############
    3. "f1.1_of_4"  0.347260 #############
    4. "f1.2_of_4"  0.310757 ##########
    5.        "f2"  0.187976

    1. "f1.3_of_4" 20.000000 ################
    2. "f1.0_of_4" 16.000000 #########
    3. "f1.1_of_4" 10.000000 
    4. "f1.2_of_4" 10.000000

    1. "f1.2_of_4" 368.000000 ################
    2. "f1.3_of_4" 367.000000 ###############
    3. "f1.0_of_4" 340.000000 #############
    4. "f1.1_of_4" 318.000000 ############
    5.        "f2" 164.000000

    1. "f1.2_of_4" 1184.639552 ################
    2. "f1.1_of_4" 1180.490537 ###############
    3. "f1.0_of_4" 1087.300719 ##############
    4. "f1.3_of_4" 1061.770106 ##############
    5.        "f2" 124.009355

这些变量重要性是在训练期间计算的。分析测试数据集上的模型时，可以获得更多、可能更具信息量的变量重要性。

树的数量 : 56

仅打印第一棵树。

Tree #0:
    "f1.3_of_4">=0.340523 [s:0.0125718 n:9027 np:5931 miss:1] ; pred:-1.59908e-08
        ├─(pos)─ "f1.0_of_4">=0.355412 [s:0.00663164 n:5931 np:3784 miss:1] ; pred:0.0534567
        |        ├─(pos)─ "f1.1_of_4">=0.285883 [s:0.00223373 n:3784 np:2645 miss:1] ; pred:0.0939348
        |        |        ├─(pos)─ "f1.2_of_4">=0.211332 [s:0.000410131 n:2645 np:2088 miss:1] ; pred:0.114401
        |        |        |        ├─(pos)─ "f1.3_of_4">=0.343045 [s:7.77674e-05 n:2088 np:2082 miss:1] ; pred:0.121303
        |        |        |        |        ├─(pos)─ pred:0.121615
        |        |        |        |        └─(neg)─ pred:0.0129024
        |        |        |        └─(neg)─ "f1.1_of_4">=0.404073 [s:0.00320274 n:557 np:479 miss:1] ; pred:0.0885265
        |        |        |                 ├─(pos)─ pred:0.103596
        |        |        |                 └─(neg)─ pred:-0.00401777
        |        |        └─(neg)─ "f1.2_of_4">=0.186632 [s:0.0112601 n:1139 np:915 miss:1] ; pred:0.0464084
        |        |                 ├─(pos)─ "f1.2_of_4">=0.528798 [s:0.00319255 n:915 np:556 miss:0] ; pred:0.0810544
        |        |                 |        ├─(pos)─ pred:0.111015
        |        |                 |        └─(neg)─ pred:0.0346534
        |        |                 └─(neg)─ "f1.0_of_4">=0.702914 [s:0.0383716 n:224 np:96 miss:0] ; pred:-0.0951145
        |        |                          ├─(pos)─ pred:0.0541452
        |        |                          └─(neg)─ pred:-0.207059
        |        └─(neg)─ "f1.1_of_4">=0.412053 [s:0.0223679 n:2147 np:1253 miss:1] ; pred:-0.0178841
        |                 ├─(pos)─ "f1.2_of_4">=0.204555 [s:0.0101297 n:1253 np:1010 miss:1] ; pred:0.065479
        |                 |        ├─(pos)─ "f1.2_of_4">=0.404707 [s:0.00241061 n:1010 np:772 miss:1] ; pred:0.0980558
        |                 |        |        ├─(pos)─ pred:0.116045
        |                 |        |        └─(neg)─ pred:0.0397044
        |                 |        └─(neg)─ "f1.3_of_4">=0.667282 [s:0.0338417 n:243 np:114 miss:0] ; pred:-0.0699227
        |                 |                 ├─(pos)─ pred:0.0592101
        |                 |                 └─(neg)─ pred:-0.18404
        |                 └─(neg)─ "f1.2_of_4">=0.494196 [s:0.0422598 n:894 np:448 miss:1] ; pred:-0.134723
        |                          ├─(pos)─ "f1.3_of_4">=0.561545 [s:0.0132409 n:448 np:285 miss:0] ; pred:0.000627708
        |                          |        ├─(pos)─ pred:0.0580524
        |                          |        └─(neg)─ pred:-0.0997774
        |                          └─(neg)─ "f1.3_of_4">=0.702899 [s:0.0247338 n:446 np:213 miss:0] ; pred:-0.270681
        |                                   ├─(pos)─ pred:-0.162138
        |                                   └─(neg)─ pred:-0.369906
        └─(neg)─ "f1.1_of_4">=0.456287 [s:0.0326619 n:3096 np:1725 miss:1] ; pred:-0.102407
                 ├─(pos)─ "f1.0_of_4">=0.465293 [s:0.0172008 n:1725 np:920 miss:1] ; pred:0.00391262
                 |        ├─(pos)─ "f1.2_of_4">=0.150376 [s:0.00671146 n:920 np:781 miss:1] ; pred:0.0848681
                 |        |        ├─(pos)─ "f1.1_of_4">=0.675697 [s:0.000847067 n:781 np:480 miss:0] ; pred:0.107675
                 |        |        |        ├─(pos)─ pred:0.122883
                 |        |        |        └─(neg)─ pred:0.0834216
                 |        |        └─(neg)─ "f1.3_of_4">=0.0952032 [s:0.0266206 n:139 np:103 miss:1] ; pred:-0.0432749
                 |        |                 ├─(pos)─ pred:0.0203768
                 |        |                 └─(neg)─ pred:-0.225389
                 |        └─(neg)─ "f1.2_of_4">=0.588246 [s:0.0368965 n:805 np:331 miss:0] ; pred:-0.0886079
                 |                 ├─(pos)─ "f1.2_of_4">=0.705104 [s:0.00581962 n:331 np:244 miss:0] ; pred:0.0630749
                 |                 |        ├─(pos)─ pred:0.0931343
                 |                 |        └─(neg)─ pred:-0.0212296
                 |                 └─(neg)─ "f1.1_of_4">=0.640417 [s:0.0161828 n:474 np:313 miss:0] ; pred:-0.19453
                 |                          ├─(pos)─ pred:-0.134324
                 |                          └─(neg)─ pred:-0.311575
                 └─(neg)─ "f1.2_of_4">=0.519391 [s:0.0405007 n:1371 np:637 miss:0] ; pred:-0.236179
                          ├─(pos)─ "f1.0_of_4">=0.316183 [s:0.0395178 n:637 np:418 miss:1] ; pred:-0.0936254
                          |        ├─(pos)─ "f1.0_of_4">=0.686852 [s:0.0172247 n:418 np:186 miss:0] ; pred:0.00132543
                          |        |        ├─(pos)─ pred:0.0980488
                          |        |        └─(neg)─ pred:-0.0762201
                          |        └─(neg)─ "f1.2_of_4">=0.893097 [s:0.03348 n:219 np:43 miss:0] ; pred:-0.274856
                          |                 ├─(pos)─ pred:-0.0305784
                          |                 └─(neg)─ pred:-0.334537
                          └─(neg)─ "f1.0_of_4">=0.667598 [s:0.0436245 n:734 np:222 miss:0] ; pred:-0.359894
                                   ├─(pos)─ "f1.1_of_4">=0.19785 [s:0.0336715 n:222 np:119 miss:1] ; pred:-0.150583
                                   |        ├─(pos)─ pred:-0.0379291
                                   |        └─(neg)─ pred:-0.280736
                                   └─(neg)─ "f1.0_of_4">=0.402493 [s:0.00914017 n:512 np:213 miss:1] ; pred:-0.45065
                                            ├─(pos)─ pred:-0.375903
                                            └─(neg)─ pred:-0.503897

分析模型和预测时也会单独显示每个维度。

In [ ]

已复制！

test_ds = create_dataset(num_examples=10000)
model.analyze(test_ds)
test_ds = create_dataset(num_examples=10000) model.analyze(test_ds)

Out[ ]

变量重要性衡量输入特征对模型的重要性。

    1. "f1.1_of_4"  0.064800 ################
    2. "f1.3_of_4"  0.064300 ###############
    3. "f1.2_of_4"  0.062700 ###############
    4. "f1.0_of_4"  0.058700 ##############
    5.        "f2"  0.004000

    1. "f1.0_of_4"  0.032397 ################
    2. "f1.3_of_4"  0.032241 ###############
    3. "f1.1_of_4"  0.031047 ###############
    4. "f1.2_of_4"  0.030587 ###############
    5.        "f2"  0.001307

    1. "f1.3_of_4"  0.113808 ################
    2. "f1.0_of_4"  0.113546 ###############
    3. "f1.1_of_4"  0.112715 ###############
    4. "f1.2_of_4"  0.110428 ###############
    5.        "f2"  0.005334

    1. "f1.0_of_4"  0.032394 ################
    2. "f1.3_of_4"  0.032237 ###############
    3. "f1.1_of_4"  0.031045 ###############
    4. "f1.2_of_4"  0.030584 ###############
    5.        "f2"  0.001307

    1. "f1.3_of_4"  0.381709 ################
    2. "f1.0_of_4"  0.364288 ##############
    3. "f1.1_of_4"  0.347260 #############
    4. "f1.2_of_4"  0.310757 ##########
    5.        "f2"  0.187976

    1. "f1.3_of_4" 20.000000 ################
    2. "f1.0_of_4" 16.000000 #########
    3. "f1.1_of_4" 10.000000 
    4. "f1.2_of_4" 10.000000

    1. "f1.2_of_4" 368.000000 ################
    2. "f1.3_of_4" 367.000000 ###############
    3. "f1.0_of_4" 340.000000 #############
    4. "f1.1_of_4" 318.000000 ############
    5.        "f2" 164.000000

    1. "f1.2_of_4" 1184.639552 ################
    2. "f1.1_of_4" 1180.490537 ###############
    3. "f1.0_of_4" 1087.300719 ##############
    4. "f1.3_of_4" 1061.770106 ##############
    5.        "f2" 124.009355