一个自我贷款状态评测的机器学习模型

最新推荐文章于 2026-06-17 10:15:29 发布

原创最新推荐文章于 2026-06-17 10:15:29 发布 · 1.2k 阅读

15 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#机器学习 #人工智能 #决策树 #随机森林

实例分析专栏收录该内容

3 篇文章

订阅专栏

题目

我们需要将根据一个人的收入、教育程度、工作经验、以前的贷款情况以及更多的因素来判断他/她是否可以获得贷款金额。

分析

在这个贷款状态预测数据集中，我们有以前根据property Loan的属性申请贷款的申请人的数据。
银行将根据申请人的收入、贷款金额、以前的信用记录、共同申请人的收入等因素来决定是否贷款给申请人。
我们的目标是建立一个机器学习模型来预测申请人的贷款被批准或被拒绝。

数据名对应关系

Loan_ID：唯一的贷款ID。
Gender：男性或女性。
Married:婚姻状况。
Dependents: 依赖于客户端的人数。
Education: 申请人学历(研究生或本科)。
Self_Employed: 自雇(是/否)。
ApplicantIncome:：申请人收入。
CoapplicantIncome：共同申请人收入。
LoanAmount：以千为单位的贷款金额。
Loan_Amount_Term：以月为单位的贷款期限。
Credit_History: 信用记录符合指导原则。
Property_Area: 申请人居住在城市、半城市或农村。
Loan_Status: 贷款批准(Y/N)。

一、导入包

此模型导入numpy，pandas，matplotlib，seaborn等包，进行使用

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

二、加载数据

df = pd.read_csv("/kaggle/input/loan-status-prediction/loan_data.csv")

df.head()

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001003	Male	Yes	1	Graduate	No	4583	1508.0	128.0	360.0	1.0	Rural	N
1	LP001005	Male	Yes	0	Graduate	Yes	3000	0.0	66.0	360.0	1.0	Urban	Y
2	LP001006	Male	Yes	0	Not Graduate	No	2583	2358.0	120.0	360.0	1.0	Urban	Y
3	LP001008	Male	No	0	Graduate	No	6000	0.0	141.0	360.0	1.0	Urban	Y
4	LP001013	Male	Yes	0	Not Graduate	No	2333	1516.0	95.0	360.0	1.0	Urban	Y

df = df.drop(['Loan_ID'], axis=1)

# 数据集中的行数和列数
df.shape

(381, 12)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381 entries, 0 to 380
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             376 non-null    object 
 1   Married            381 non-null    object 
 2   Dependents         373 non-null    object 
 3   Education          381 non-null    object 
 4   Self_Employed      360 non-null    object 
 5   ApplicantIncome    381 non-null    int64  
 6   CoapplicantIncome  381 non-null    float64
 7   LoanAmount         381 non-null    float64
 8   Loan_Amount_Term   370 non-null    float64
 9   Credit_History     351 non-null    float64
 10  Property_Area      381 non-null    object 
 11  Loan_Status        381 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 35.8+ KB

三、处理数据集中缺失的值

df.isnull().sum()

Gender                5
Married               0
Dependents            8
Education             0
Self_Employed        21
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     11
Credit_History       30
Property_Area         0
Loan_Status           0
dtype: int64

df['Gender'] = df['Gender'].fillna(df['Gender'].mode().iloc[0])
df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode().iloc[0])
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode().iloc[0]).astype(int)
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mode().iloc[0]).astype(int)

df['Dependents'] = df['Dependents'].replace(['0', '1', '2', '3+'], [0,1,2,3,])
df['Dependents'] = df['Dependents'].fillna(df['Dependents'].mode().iloc[0])

df['CoapplicantIncome'] = df['CoapplicantIncome'].astype(int)
df['LoanAmount'] = df['LoanAmount'].astype(int)

df.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

四、将分类数据转换为数字形式

def cat_to_num(df, c_var):
    for i in c_var:
        uniques_value = df[i].unique()
        df[i].replace(uniques_value, [0, 1], inplace=True)

    for i in ['Property_Area']:
        uniques_value = df[i].unique()
        df[i].replace(uniques_value, [0, 1, 3], inplace=True)

c_variables = ['Gender', 'Married', 'Education', 'Education','Self_Employed', 'Loan_Status']

cat_to_num(df, c_variables)

df.head()

	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	0	1.0	0	0	4583	1508	128	360	1	0	0
1	0	0.0	0	1	3000	0	66	360	1	1	1
2	0	0.0	1	0	2583	2358	120	360	1	1	1
3	1	0.0	0	0	6000	0	141	360	1	1	1
4	0	0.0	1	0	2333	1516	95	360	1	1	1

五、数据可视化

分析分配给列的分类值

fig, ax = plt.subplots(3, 2, figsize=(12,15))

for index, cat_col in enumerate(c_variables):
    row, col = index//2, index%2
    sns.countplot(x=cat_col, data=df, hue='Loan_Status', ax=ax[row, col])

plt.subplots_adjust(hspace=1)

分析数值列

numerical_columns = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']

fig,axes = plt.subplots(1,3,figsize=(17,5))
for idx,cat_col in enumerate(numerical_columns):
    sns.boxplot(y=cat_col,data=df,x='Loan_Status',ax=axes[idx])

print(df[numerical_columns].describe())
plt.subplots_adjust(hspace=1)

六、数据预处理

X = df.drop(['Loan_Status'], axis=1)
y = df['Loan_Status']

X.shape, y.shape

((381, 11), (381,))

划分训练集和测试集

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((304, 11), (304,), (77, 11), (77,))

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

七、模型——决策树分类器

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,roc_auc_score

model = DecisionTreeClassifier(max_depth=3,min_samples_leaf = 35)

model.fit(X_train,y_train)

八、测试

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
roc_score = roc_auc_score(y_test, y_pred)

print(f'Accuracy Score: {accuracy*100:0.2f}%')
print(f'Roc Score: {roc_score*100:0.2f}%')

Accuracy Score: 81.82%
Roc Score: 66.67%

pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted	0	1	All
True
0	7	14	21
1	0	56	56
All	7	70	77