异常检测¶

设置¶

In [ ]

已复制！

pip install ydf ucimlrepo scikit-learn umap-learn plotly -U -q
pip install ydf ucimlrepo scikit-learn umap-learn plotly -U -q

什么是异常检测？¶

异常检测 技术是一种非监督学习算法，用于识别数据中显著偏离常规的罕见和异常模式。例如，异常检测可用于欺诈检测、网络入侵检测和故障诊断，无需定义异常实例。

使用决策森林进行异常检测是一种直接但有效的表格数据技术。模型为每个数据点分配一个异常得分，范围从 0（正常）到 1（异常）。决策森林还提供了可解释性工具和属性，使得更容易理解和描述检测到的异常。

在异常检测中，带标签的示例不用于训练，而是用于评估模型。这些标签确保模型能够检测到已知的异常。

我们在 UCI Covertype 数据集上训练并评估了两个异常检测模型，该数据集描述了森林覆盖类型和土地单元的其他地理属性。第一个模型使用松树 (pine) 和柳树 (willow) 数据进行训练。鉴于柳树比松树稀有，模型可以在没有标签的情况下区分它们。然后将解释第一个模型，并描述松树覆盖类型所包含的特征。

In [1]

已复制！





# Load libraries
import ydf  # For learning the anomaly detection model
import pandas as pd  # We use Pandas to load small datasets
from sklearn import metrics  # Use sklearn to compute AUC
from ucimlrepo import fetch_ucirepo  # To download the dataset
import matplotlib.pyplot as plt  # For plotting
import seaborn as sns  # For plotting
import umap  # For projecting distances in 2d

# For interactive plots
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.io as pio
pio.renderers.default="colab"

# Disable Pandas warnings
pd.options.mode.chained_assignment = None
# 加载库 import ydf # 用于学习异常检测模型 import pandas as pd # 使用 Pandas 加载小型数据集 from sklearn import metrics # 使用 sklearn 计算 AUC from ucimlrepo import fetch_ucirepo # 下载数据集 import matplotlib.pyplot as plt # 用于绘图 import seaborn as sns # 用于绘图 import umap # 用于将距离投影到二维空间 # 用于交互式绘图 import plotly.graph_objs as go from plotly.offline import iplot import plotly.io as pio pio.renderers.default="colab" # 禁用 Pandas 警告 pd.options.mode.chained_assignment = None

/usr/local/google/home/gbm/my_venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
2024-06-17 13:06:13.648825: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-17 13:06:14.292005: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

我们从 UCI 下载 Covertype 数据集。

准备数据集¶

In [2]

已复制！

# https://archive.ics.uci.edu/dataset/31/covertype
covertype_repo = fetch_ucirepo(id=31)
raw_dataset = pd.concat([covertype_repo.data.features, covertype_repo.data.targets], axis=1)
# https://archive.ics.uci.edu/dataset/31/covertype covertype_repo = fetch_ucirepo(id=31) raw_dataset = pd.concat([covertype_repo.data.features, covertype_repo.data.targets], axis=1)

选择感兴趣的列并清理标签。

In [3]

已复制！





dataset = raw_dataset.copy()

# Features of interest
features = ["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology",
            "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways",
            "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm",
            "Horizontal_Distance_To_Fire_Points"]
dataset = dataset[features + ["Cover_Type"]]

# Covert type as text
dataset["Cover_Type"] = dataset["Cover_Type"].map({
    1: "Spruce/Fir",
    2: "Lodgepole Pine",
    3: "Ponderosa Pine",
    4: "Cottonwood/Willow",
    5: "Aspen",
    6: "Douglas-fir",
    7: "Krummholz"
})

dataset.head()
dataset = raw_dataset.copy() # 感兴趣的特征 features = ["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points"] dataset = dataset[features + ["Cover_Type"]] # 将覆盖类型转换为文本 dataset["Cover_Type"] = dataset["Cover_Type"].map({ 1: "Spruce/Fir", 2: "Lodgepole Pine", 3: "Ponderosa Pine", 4: "Cottonwood/Willow", 5: "Aspen", 6: "Douglas-fir", 7: "Krummholz" }) dataset.head()

Out[3]

	Elevation	Aspect	Slope	Horizontal_Distance_To_Hydrology	Vertical_Distance_To_Hydrology	Horizontal_Distance_To_Roadways	Hillshade_9am	Hillshade_Noon	Hillshade_3pm	Horizontal_Distance_To_Fire_Points	Cover_Type
0	2596	51	3	258	0	510	221	232	148	6279	Aspen
1	2590	56	2	212	-6	390	220	235	151	6225	Aspen
2	2804	139	9	268	65	3180	234	238	135	6121	Lodgepole Pine
3	2785	155	18	242	118	3090	238	238	122	6211	Lodgepole Pine
4	2595	45	2	153	-1	391	220	234	150	6172	Aspen

第一个模型在仅包含云杉/冷杉 (spruce/fir) 和棉白杨/柳树 (cottonwood/willow) 示例的“过滤数据集”上进行训练。

In [4]

已复制！

filtered_dataset = dataset[dataset["Cover_Type"].isin(["Spruce/Fir", "Cottonwood/Willow"])]
filtered_dataset = dataset[dataset["Cover_Type"].isin(["Spruce/Fir", "Cottonwood/Willow"])]

如您所见，云杉/冷杉覆盖比棉白杨/柳树覆盖常见得多

In [5]

已复制！

filtered_dataset["Cover_Type"].value_counts()
filtered_dataset["Cover_Type"].value_counts()

Out[5]

Cover_Type
Spruce/Fir           211840
Cottonwood/Willow      2747
Name: count, dtype: int64

我们训练了一个流行的异常检测决策森林算法，称为孤立森林 (isolation forest)。

异常检测模型¶

In [6]

已复制！

model = ydf.IsolationForestLearner(features=features).train(filtered_dataset)
model = ydf.IsolationForestLearner(features=features).train(filtered_dataset)

Train model on 214587 examples
Model trained in 0:00:00.074241

然后我们可以生成“预测”，即异常得分。

In [7]

已复制！

predictions = model.predict(filtered_dataset)
predictions[:5]
predictions = model.predict(filtered_dataset) predictions[:5]

Out[7]

array([0.57844853, 0.609949  , 0.5433627 , 0.6099571 , 0.48067462],
      dtype=float32)

接下来，我们绘制云杉/冷杉和棉白杨/柳树覆盖类型的模型异常得分分布图。我们看到两个分布是“分开的”，表明模型能够区分这两种覆盖类型。

注意： 重要的是要注意，由于棉白杨/柳树覆盖频率较低，这两个分布是分开归一化的。否则，棉白杨/柳树的分布将显得平坦。

In [8]

已复制！





sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Spruce/Fir"], label="Spruce/Fir")
sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"], label="Cottonwood/Willow")
plt.xlabel("predicted anomaly score")
plt.ylabel("distribution")
plt.legend()
None
sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Spruce/Fir"], label="云杉/冷杉") sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"], label="棉白杨/柳树") plt.xlabel("预测异常得分") plt.ylabel("分布") plt.legend() None

No description has been provided for this image

AUC 是用于评估分类模型的指标。它也可以用于量化任何信号区分两个不同类别的能力。在异常检测的背景下，我们可以使用 AUC 来量化我们的异常检测模型隔离少数类别的能力。

覆盖类型信息未用于训练模型，并且数据集被认为是静态的（即覆盖类型不会随时间变化）。因此，我们无需将数据集分为训练集和测试集，而是使用所有数据进行模型训练和 AUC 评估。

In [9]

已复制！

metrics.roc_auc_score(filtered_dataset["Cover_Type"] == "Cottonwood/Willow", predictions)
metrics.roc_auc_score(filtered_dataset["Cover_Type"] == "Cottonwood/Willow", predictions)

Out[9]

0.9427246186652949

这个高 AUC 证实了模型能够很好地分离这两种覆盖类型。

我们还可以分析模型以理解它：例如，我们在海拔偏依赖图上看到，“正常”覆盖类型出现在海拔 2900 到 3300 米左右。通过查看其他属性可以得出其他类似的结论。

In [10]

已复制！

model.analyze(filtered_dataset, sampling=0.001) # Use larger sampling for better results
model.analyze(filtered_dataset, sampling=0.001) # 使用更大的采样率以获得更好的结果

Out[10]

我们还可以解释单个模型的预测。例如，我们选择第一个棉白杨/柳树示例并生成一个预测

In [11]

已复制！

first_willow_example = filtered_dataset[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"][:1]
first_willow_example
first_willow_example = filtered_dataset[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"][:1] first_willow_example

Out[11]

	Elevation	Aspect	Slope	Horizontal_Distance_To_Hydrology	Vertical_Distance_To_Hydrology	Horizontal_Distance_To_Roadways	Hillshade_9am	Hillshade_Noon	Hillshade_3pm	Horizontal_Distance_To_Fire_Points	Cover_Type
1988	2000	318	7	30	4	108	201	234	172	268	Cottonwood/Willow

In [12]

已复制！

model.predict(first_willow_example)
model.predict(first_willow_example)

Out[12]

array([0.5474113], dtype=float32)

现在，让我们看看模型的预测如何随着此示例的特征值变化

我们看到示例的海拔高度 2000 米并不常见，这解释了部分较高的预测值。另一方面，示例的“坡向”和“坡度”相对正常。

In [13]

已复制！

model.analyze_prediction(first_willow_example)
model.analyze_prediction(first_willow_example)

Out[13]

列出所有决策森林算法，我们的孤立森林模型定义了示例之间的隐式距离。此距离可用于对示例进行聚类或进行可解释映射。

让我们计算每对示例之间的距离。为了让代码运行得更快，我们只选择前 10000 个示例。

In [14]

已复制！

distances = model.distance(filtered_dataset[:10000]) # Use more examples for better results
distances[:4, :4]
distances = model.distance(filtered_dataset[:10000]) # 使用更多示例以获得更好的结果 distances[:4, :4]

Out[14]

array([[0.       , 0.86     , 0.6766667, 0.85     ],
       [0.86     , 0.       , 0.9066667, 0.31     ],
       [0.6766667, 0.9066667, 0.       , 0.8833333],
       [0.85     , 0.31     , 0.8833333, 0.       ]], dtype=float32)

然后我们可以使用 UMAP（或任何其他流形学习算法，如 T-SNE）将示例投影到二维图中。

请注意，尽管模型从未见过这些覆盖类型，但它们却很好地分开了。

In [15]

已复制！

manifold = umap.UMAP(n_components=2, n_neighbors=10, metric="precomputed").fit_transform(distances)
sns.scatterplot(x=manifold[:, 0],
                y=manifold[:, 1],
                hue=filtered_dataset["Cover_Type"][:manifold.shape[0]])
plt.legend()
manifold = umap.UMAP(n_components=2, n_neighbors=10, metric="precomputed").fit_transform(distances) sns.scatterplot(x=manifold[:, 0], y=manifold[:, 1], hue=filtered_dataset["Cover_Type"][:manifold.shape[0]]) plt.legend()

/usr/local/google/home/gbm/my_venv/lib/python3.11/site-packages/umap/umap_.py:1858: UserWarning:

using precomputed metric; inverse_transform will be unavailable

Out[15]

<matplotlib.legend.Legend at 0x7faa84413490>