利用支持向量机分析乳腺癌数据集

最新推荐文章于 2025-12-10 11:14:14 发布

原创最新推荐文章于 2025-12-10 11:14:14 发布 · 1.5k 阅读

11 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#支持向量机 #python #交叉验证 #感知机

数据挖掘与机器学习专栏收录该内容

5 篇文章

订阅专栏

该博客介绍了如何使用支持向量机（SVM）和感知机算法对乳腺癌数据集进行分类。实验要求包括计算分类准确率、精确率、召回率、F1-score并绘制ROC曲线，同时通过5折交叉验证评估模型性能。通过对两种算法的对比，分析它们在实际问题中的效果差异。

Python3.8

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本

实验要求

数据中已经分好了训练集和测试集，是二分类问题（阴性和阳性），使用支持向量机建模对数据进行分类。
具体要求：
（1）得出相应的分类指标准确率accuracy，精确率precision，召回率recall，F1-score，并画出最终的ROC曲线，得出AUC值。
（2）对比感知机算法也进行训练和测试，比较两个算法的结果。
（3）运用5-fold Cross-validation方法进行验证。

数据展示

在这里插入图片描述

代码

导包

import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.linear_model import Perceptron
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score, cross_val_predict

加载数据

# 加载数据
train = pd.read_csv(r"C:\Users\guo\Desktop\课程\医学数据挖掘\实验3-支持向量机分析乳腺癌数据实验\breast-cancer-train.csv").iloc[:, 1:]
test = pd.read_csv(r"C:\Users\guo\Desktop\课程\医学数据挖掘\实验3-支持向量机分析乳腺癌数据实验\breast-cancer-test.csv").iloc[:, 1:]

x_train, y_train = train.iloc[:, :-1], train.iloc[:, -1]
x_test, y_test = test.iloc[:, :-1], test.iloc[:, -1]

SVM

# 用线性核函数建立支持向量机模型
model = svm.SVC(kernel='linear', probability=True)
model.fit(x_train, y_train)

print("Accuracy:", model.score(x_test, y_test))
# 精确率
print("Precision:", precision_score(y_test, model.predict(x_test)))
# 召回率
print("Recall:", recall_score(y_test, model.predict(x_test)))
# F1值
print("F1-Score:", f1_score(y_test, model.predict(x_test)))
# 画出ROC曲线
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(x_test)[:, 1])
plt.plot(fpr, tpr, linewidth=2, label="ROC(AUC=%0.3f)" % roc_auc_score(y_test, model.predict_proba(x_test)[:, 1]),
         color="green")
plt.xlabel('FPR')  # False Positive Rate,假阳性率
plt.ylabel('TPR')  # True Positive Rate,真阳性率
plt.ylim(0, 1.05)
plt.xlim(0, 1.05)
plt.legend(loc=4)
plt.show()

在这里插入图片描述

感知机算法

# 感知机算法
model = Perceptron()
model.fit(x_train, y_train)

print("Accuracy:", model.score(x_test, y_test))
# 精确率
print("Precision:", precision_score(y_test, model.predict(x_test)))
# 召回率
print("Recall:", recall_score(y_test, model.predict(x_test)))
# F1值
print("F1-Score:", f1_score(y_test, model.predict(x_test)))
# 画出ROC曲线
fpr, tpr, thresholds = roc_curve(y_test, model.predict(x_test))
plt.plot(fpr, tpr, linewidth=2, label="ROC(AUC=%0.3f)" % roc_auc_score(y_test, model.predict(x_test)), color="green")
plt.xlabel('FPR')  # False Positive Rate,假阳性率
plt.ylabel('TPR')  # True Positive Rate,真阳性率
plt.ylim(0, 1.05)
plt.xlim(0, 1.05)
plt.legend(loc=4)
plt.show()

在这里插入图片描述

5折交叉验证

# 拼接数据，用于5折交叉验证
x = x_train.append(x_test)
y = y_train.append(y_test)

# 5折交叉验证
# 5-fold cross-validation
model = svm.SVC(kernel='linear', probability=True)
scores = cross_val_score(model, x, y, cv=5)
print('scores:', scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
# 精确率
print("Precision:", cross_val_score(model, x, y, cv=5, scoring='precision').mean())
# 召回率
print("Recall:", cross_val_score(model, x, y, cv=5, scoring='recall').mean())
# F1值
print("F1-Score:", cross_val_score(model, x, y, cv=5, scoring='f1').mean())
# 画出ROC曲线

y_scores = cross_val_predict(model, x, y, cv=5, method='decision_function')
fpr, tpr, thresholds = roc_curve(y, y_scores)
plt.plot(fpr, tpr, linewidth=2, label='ROC(AUC=%0.3f)' % cross_val_score(model, x, y, cv=5, scoring='roc_auc').mean(),
         color='green')
plt.xlabel('FPR')  # False Positive Rate,假阳性率
plt.ylabel('TPR')  # True Positive Rate,真阳性率
plt.ylim(0, 1.05)
plt.xlim(0, 1.05)
plt.legend(loc=4)
plt.show()