KMeans

# 简介

这是一种利用欧氏距离、曼哈顿距离计算两点距离并寻找簇中心点的方法。

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 利用make_blobs创建簇数据集
x,y = make_blobs(n_samples=500,n_features=2,centers=4,random_state=1)
fig,ax1 = plt.subplots(1)
ax1.scatter(x[:,0],x[:,1],marker='o',s=8)
plt.show()

# 分类
from sklearn.cluster import KMeans
n_clusters = 3
cluster = KMeans(n_clusters=n_clusters,random_state=0).fit(x)
# 重要属性labels_, 查看聚好的类别，每个样本所对应的类
y_pred_ = cluster.predict(x)

# 画图
color = ['red','blue','green','gray']

fig, ax = plt.subplots(1)
for i in range(n_clusters):
    ax.scatter(x[y_pred==i,0],x[y_pred==i,1],marker='o',s=8,c=color[i])
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# 模型评估

KMeans仅适合用来对一团团单独的簇团进行分类，KMeans对象拥有inertia属性，该属性能用来评估模型的好坏，值越小越好，inertia是质心距离和，随着n_cluster越大，inertia越小，当样本量=n_cluster，inertia=0。由此可见，我们并不能只用inertia作为唯一评估标准。

我们可以用到轮廓系数评估聚类模型，轮廓系数越接近1越好。

from sklearn.metrics import silhouette_samples,silhouette_score
silhouette_score(x,cluster_.labels_) #当 n_cluster = 3，轮廓系数=0.5882004012129721
silhouette_score(x,cluster_.labels_) #当 n_cluster = 4，轮廓系数=0.6505186632729437
silhouette_score(x,cluster_.labels_) #当 n_cluster = 5，轮廓系数=0.5746932321727457

1
2
3
4

从上述的例子中看得出，轮廓系数不会和inertia一样，随着n_cluster上升而“提高性能”，当n_cluster = 5时，评估明显还不如n_cluster = 3

← 逻辑回归流程补充→