In [ ]
已复制!
pip install ydf -U
pip install ydf -U
In [ ]
已复制!
import ydf
import numpy as np
import ydf import numpy as np
什么是多维特征?¶
多维特征是具有多个维度的模型输入。例如,输入时间序列的多个时间戳或图像中不同像素的值都是多维特征。它们不同于只有一个维度的单维特征。多维特征的每个维度都被视为一个独立的单维特征。
多维特征以多维数组的形式输入,例如 Numpy 数组或 TensorFlow 向量。接下来的示例展示了一个将多维特征输入模型的玩具示例。
创建多维数据集¶
创建多维数据集的最简单方法是使用多维 NumPy 数组的字典。
In [ ]
已复制!
def create_dataset(num_examples):
# Generates random feature values.
dataset = {
# f1 is a 4 multi-dimensional feature.
"f1": np.random.uniform(size=(num_examples, 4)),
# f2 is a single-dimensional feature.
"f2": np.random.uniform(size=(num_examples)),
}
# Add a synthetic label
noise = np.random.uniform(size=num_examples)
dataset["label"] = (
np.sum(dataset["f1"], axis=1) + dataset["f2"] * 0.2 + noise
) >= 2.0
return dataset
print("A dataset with 5 examples:")
create_dataset(num_examples=5)
def create_dataset(num_examples): # 生成随机特征值。 dataset = { # f1 是一个 4 维多维特征。 "f1": np.random.uniform(size=(num_examples, 4)), # f2 是一个单维特征。 "f2": np.random.uniform(size=(num_examples)), } # 添加合成标签 noise = np.random.uniform(size=num_examples) dataset["label"] = ( np.sum(dataset["f1"], axis=1) + dataset["f2"] * 0.2 + noise ) >= 2.0 return dataset print("包含 5 个样本的数据集:") create_dataset(num_examples=5)
A dataset with 5 examples:
Out[ ]
{'f1': array([[0.5373759 , 0.18098291, 0.74489824, 0.27706572], [0.4517745 , 0.37578001, 0.45156836, 0.05413219], [0.77036813, 0.1640734 , 0.47994649, 0.06315383], [0.44115416, 0.95749836, 0.80662146, 0.78114808], [0.40393628, 0.22786682, 0.32477702, 0.18309577]]), 'f2': array([0.02058218, 0.94332705, 0.25678716, 0.02122367, 0.04498769]), 'label': array([False, True, False, True, False])}
训练模型¶
在多维特征上训练模型与在单维特征上训练模型类似。
In [ ]
已复制!
train_ds = create_dataset(num_examples=10000)
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
train_ds = create_dataset(num_examples=10000) model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
Train model on 10000 examples Model trained in 0:00:02.789326
模型理解¶
解释模型时,多维特征的每个维度都会被独立处理。例如,描述模型时会单独显示每个维度。
In [ ]
已复制!
model.describe()
model.describe()
Out[ ]
名称 : GRADIENT_BOOSTED_TREES
任务 : 分类
标签 : label
特征 (5) : f1.0_of_4 f1.1_of_4 f1.2_of_4 f1.3_of_4 f2
权重 : None
使用调优器训练 : 否
模型大小 : 767 kB
任务 : 分类
标签 : label
特征 (5) : f1.0_of_4 f1.1_of_4 f1.2_of_4 f1.3_of_4 f2
权重 : None
使用调优器训练 : 否
模型大小 : 767 kB
Number of records: 10000 Number of columns: 6 Number of columns by type: NUMERICAL: 5 (83.3333%) CATEGORICAL: 1 (16.6667%) Columns: NUMERICAL: 5 (83.3333%) 1: "f1.0_of_4" NUMERICAL mean:0.49459 min:4.63251e-05 max:0.999917 sd:0.289597 2: "f1.1_of_4" NUMERICAL mean:0.498703 min:5.8423e-06 max:0.999997 sd:0.289197 3: "f1.2_of_4" NUMERICAL mean:0.498227 min:7.85791e-05 max:0.999943 sd:0.288629 4: "f1.3_of_4" NUMERICAL mean:0.496773 min:9.6696e-05 max:0.99987 sd:0.28987 5: "f2" NUMERICAL mean:0.504066 min:3.89178e-05 max:0.999976 sd:0.289052 CATEGORICAL: 1 (16.6667%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"true" 8140 (81.4%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
以下评估是在验证集或袋外数据集上计算的。
Task: CLASSIFICATION Label: label Loss (BINOMIAL_LOG_LIKELIHOOD): 0.48424 Accuracy: 0.88592 CI95[W][0 1] ErrorRate: : 0.11408 Confusion Table: truth\prediction false true false 114 65 true 46 748 Total: 973
变量重要性衡量输入特征对模型的重要性。
1. "f1.3_of_4" 0.381709 ################ 2. "f1.0_of_4" 0.364288 ############## 3. "f1.1_of_4" 0.347260 ############# 4. "f1.2_of_4" 0.310757 ########## 5. "f2" 0.187976
1. "f1.3_of_4" 20.000000 ################ 2. "f1.0_of_4" 16.000000 ######### 3. "f1.1_of_4" 10.000000 4. "f1.2_of_4" 10.000000
1. "f1.2_of_4" 368.000000 ################ 2. "f1.3_of_4" 367.000000 ############### 3. "f1.0_of_4" 340.000000 ############# 4. "f1.1_of_4" 318.000000 ############ 5. "f2" 164.000000
1. "f1.2_of_4" 1184.639552 ################ 2. "f1.1_of_4" 1180.490537 ############### 3. "f1.0_of_4" 1087.300719 ############## 4. "f1.3_of_4" 1061.770106 ############## 5. "f2" 124.009355
这些变量重要性是在训练期间计算的。分析测试数据集上的模型时,可以获得更多、可能更具信息量的变量重要性。
树的数量 : 56
仅打印第一棵树。
Tree #0: "f1.3_of_4">=0.340523 [s:0.0125718 n:9027 np:5931 miss:1] ; pred:-1.59908e-08 ├─(pos)─ "f1.0_of_4">=0.355412 [s:0.00663164 n:5931 np:3784 miss:1] ; pred:0.0534567 | ├─(pos)─ "f1.1_of_4">=0.285883 [s:0.00223373 n:3784 np:2645 miss:1] ; pred:0.0939348 | | ├─(pos)─ "f1.2_of_4">=0.211332 [s:0.000410131 n:2645 np:2088 miss:1] ; pred:0.114401 | | | ├─(pos)─ "f1.3_of_4">=0.343045 [s:7.77674e-05 n:2088 np:2082 miss:1] ; pred:0.121303 | | | | ├─(pos)─ pred:0.121615 | | | | └─(neg)─ pred:0.0129024 | | | └─(neg)─ "f1.1_of_4">=0.404073 [s:0.00320274 n:557 np:479 miss:1] ; pred:0.0885265 | | | ├─(pos)─ pred:0.103596 | | | └─(neg)─ pred:-0.00401777 | | └─(neg)─ "f1.2_of_4">=0.186632 [s:0.0112601 n:1139 np:915 miss:1] ; pred:0.0464084 | | ├─(pos)─ "f1.2_of_4">=0.528798 [s:0.00319255 n:915 np:556 miss:0] ; pred:0.0810544 | | | ├─(pos)─ pred:0.111015 | | | └─(neg)─ pred:0.0346534 | | └─(neg)─ "f1.0_of_4">=0.702914 [s:0.0383716 n:224 np:96 miss:0] ; pred:-0.0951145 | | ├─(pos)─ pred:0.0541452 | | └─(neg)─ pred:-0.207059 | └─(neg)─ "f1.1_of_4">=0.412053 [s:0.0223679 n:2147 np:1253 miss:1] ; pred:-0.0178841 | ├─(pos)─ "f1.2_of_4">=0.204555 [s:0.0101297 n:1253 np:1010 miss:1] ; pred:0.065479 | | ├─(pos)─ "f1.2_of_4">=0.404707 [s:0.00241061 n:1010 np:772 miss:1] ; pred:0.0980558 | | | ├─(pos)─ pred:0.116045 | | | └─(neg)─ pred:0.0397044 | | └─(neg)─ "f1.3_of_4">=0.667282 [s:0.0338417 n:243 np:114 miss:0] ; pred:-0.0699227 | | ├─(pos)─ pred:0.0592101 | | └─(neg)─ pred:-0.18404 | └─(neg)─ "f1.2_of_4">=0.494196 [s:0.0422598 n:894 np:448 miss:1] ; pred:-0.134723 | ├─(pos)─ "f1.3_of_4">=0.561545 [s:0.0132409 n:448 np:285 miss:0] ; pred:0.000627708 | | ├─(pos)─ pred:0.0580524 | | └─(neg)─ pred:-0.0997774 | └─(neg)─ "f1.3_of_4">=0.702899 [s:0.0247338 n:446 np:213 miss:0] ; pred:-0.270681 | ├─(pos)─ pred:-0.162138 | └─(neg)─ pred:-0.369906 └─(neg)─ "f1.1_of_4">=0.456287 [s:0.0326619 n:3096 np:1725 miss:1] ; pred:-0.102407 ├─(pos)─ "f1.0_of_4">=0.465293 [s:0.0172008 n:1725 np:920 miss:1] ; pred:0.00391262 | ├─(pos)─ "f1.2_of_4">=0.150376 [s:0.00671146 n:920 np:781 miss:1] ; pred:0.0848681 | | ├─(pos)─ "f1.1_of_4">=0.675697 [s:0.000847067 n:781 np:480 miss:0] ; pred:0.107675 | | | ├─(pos)─ pred:0.122883 | | | └─(neg)─ pred:0.0834216 | | └─(neg)─ "f1.3_of_4">=0.0952032 [s:0.0266206 n:139 np:103 miss:1] ; pred:-0.0432749 | | ├─(pos)─ pred:0.0203768 | | └─(neg)─ pred:-0.225389 | └─(neg)─ "f1.2_of_4">=0.588246 [s:0.0368965 n:805 np:331 miss:0] ; pred:-0.0886079 | ├─(pos)─ "f1.2_of_4">=0.705104 [s:0.00581962 n:331 np:244 miss:0] ; pred:0.0630749 | | ├─(pos)─ pred:0.0931343 | | └─(neg)─ pred:-0.0212296 | └─(neg)─ "f1.1_of_4">=0.640417 [s:0.0161828 n:474 np:313 miss:0] ; pred:-0.19453 | ├─(pos)─ pred:-0.134324 | └─(neg)─ pred:-0.311575 └─(neg)─ "f1.2_of_4">=0.519391 [s:0.0405007 n:1371 np:637 miss:0] ; pred:-0.236179 ├─(pos)─ "f1.0_of_4">=0.316183 [s:0.0395178 n:637 np:418 miss:1] ; pred:-0.0936254 | ├─(pos)─ "f1.0_of_4">=0.686852 [s:0.0172247 n:418 np:186 miss:0] ; pred:0.00132543 | | ├─(pos)─ pred:0.0980488 | | └─(neg)─ pred:-0.0762201 | └─(neg)─ "f1.2_of_4">=0.893097 [s:0.03348 n:219 np:43 miss:0] ; pred:-0.274856 | ├─(pos)─ pred:-0.0305784 | └─(neg)─ pred:-0.334537 └─(neg)─ "f1.0_of_4">=0.667598 [s:0.0436245 n:734 np:222 miss:0] ; pred:-0.359894 ├─(pos)─ "f1.1_of_4">=0.19785 [s:0.0336715 n:222 np:119 miss:1] ; pred:-0.150583 | ├─(pos)─ pred:-0.0379291 | └─(neg)─ pred:-0.280736 └─(neg)─ "f1.0_of_4">=0.402493 [s:0.00914017 n:512 np:213 miss:1] ; pred:-0.45065 ├─(pos)─ pred:-0.375903 └─(neg)─ pred:-0.503897
分析模型和预测时也会单独显示每个维度。
In [ ]
已复制!
test_ds = create_dataset(num_examples=10000)
model.analyze(test_ds)
test_ds = create_dataset(num_examples=10000) model.analyze(test_ds)
Out[ ]
变量重要性衡量输入特征对模型的重要性。
1. "f1.1_of_4" 0.064800 ################ 2. "f1.3_of_4" 0.064300 ############### 3. "f1.2_of_4" 0.062700 ############### 4. "f1.0_of_4" 0.058700 ############## 5. "f2" 0.004000
1. "f1.0_of_4" 0.032397 ################ 2. "f1.3_of_4" 0.032241 ############### 3. "f1.1_of_4" 0.031047 ############### 4. "f1.2_of_4" 0.030587 ############### 5. "f2" 0.001307
1. "f1.3_of_4" 0.113808 ################ 2. "f1.0_of_4" 0.113546 ############### 3. "f1.1_of_4" 0.112715 ############### 4. "f1.2_of_4" 0.110428 ############### 5. "f2" 0.005334
1. "f1.0_of_4" 0.032394 ################ 2. "f1.3_of_4" 0.032237 ############### 3. "f1.1_of_4" 0.031045 ############### 4. "f1.2_of_4" 0.030584 ############### 5. "f2" 0.001307
1. "f1.3_of_4" 0.381709 ################ 2. "f1.0_of_4" 0.364288 ############## 3. "f1.1_of_4" 0.347260 ############# 4. "f1.2_of_4" 0.310757 ########## 5. "f2" 0.187976
1. "f1.3_of_4" 20.000000 ################ 2. "f1.0_of_4" 16.000000 ######### 3. "f1.1_of_4" 10.000000 4. "f1.2_of_4" 10.000000
1. "f1.2_of_4" 368.000000 ################ 2. "f1.3_of_4" 367.000000 ############### 3. "f1.0_of_4" 340.000000 ############# 4. "f1.1_of_4" 318.000000 ############ 5. "f2" 164.000000
1. "f1.2_of_4" 1184.639552 ################ 2. "f1.1_of_4" 1180.490537 ############### 3. "f1.0_of_4" 1087.300719 ############## 4. "f1.3_of_4" 1061.770106 ############## 5. "f2" 124.009355