题目
我们需要将根据一个人的收入、教育程度、工作经验、以前的贷款情况以及更多的因素来判断他/她是否可以获得贷款金额。
分析
- 在这个贷款状态预测数据集中,我们有以前根据property Loan的属性申请贷款的申请人的数据。
- 银行将根据申请人的收入、贷款金额、以前的信用记录、共同申请人的收入等因素来决定是否贷款给申请人。
- 我们的目标是建立一个机器学习模型来预测申请人的贷款被批准或被拒绝。
数据名对应关系
Loan_ID:唯一的贷款ID。
Gender:男性或女性。
Married:婚姻状况。
Dependents: 依赖于客户端的人数。
Education: 申请人学历(研究生或本科)。
Self_Employed: 自雇(是/否)。
ApplicantIncome::申请人收入。
CoapplicantIncome:共同申请人收入。
LoanAmount:以千为单位的贷款金额。
Loan_Amount_Term:以月为单位的贷款期限。
Credit_History: 信用记录符合指导原则。
Property_Area: 申请人居住在城市、半城市或农村。
Loan_Status: 贷款批准(Y/N)。
一、导入包
此模型导入numpy,pandas,matplotlib,seaborn等包,进行使用
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
二、加载数据
df = pd.read_csv("/kaggle/input/loan-status-prediction/loan_data.csv")
df.head()
| Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
| 1 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
| 2 | LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.0 | 360.0 | 1.0 | Urban | Y |
| 3 | LP001008 | Male | No | 0 | Graduate | No | 6000 | 0.0 | 141.0 | 360.0 | 1.0 | Urban | Y |
| 4 | LP001013 | Male | Yes | 0 | Not Graduate | No | 2333 | 1516.0 | 95.0 | 360.0 | 1.0 | Urban | Y |
df = df.drop(['Loan_ID'], axis=1)
# 数据集中的行数和列数
df.shape
(381, 12)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381 entries, 0 to 380
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 376 non-null object
1 Married 381 non-null object
2 Dependents 373 non-null object
3 Education 381 non-null object
4 Self_Employed 360 non-null object
5 ApplicantIncome 381 non-null int64
6 CoapplicantIncome 381 non-null float64
7 LoanAmount 381 non-null float64
8 Loan_Amount_Term 370 non-null float64
9 Credit_History 351 non-null float64
10 Property_Area 381 non-null object
11 Loan_Status 381 non-null object
dtypes: float64(4), int64(1), object(7)
memory usage: 35.8+ KB
三、处理数据集中缺失的值
df.isnull().sum()
Gender 5
Married 0
Dependents 8
Education 0
Self_Employed 21
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 11
Credit_History 30
Property_Area 0
Loan_Status 0
dtype: int64
df['Gender'] = df['Gender'].fillna(df['Gender'].mode().iloc[0])
df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode().iloc[0])
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode().iloc[0]).astype(int)
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mode().iloc[0]).astype(int)
df['Dependents'] = df['Dependents'].replace(['0', '1', '2', '3+'], [0,1,2,3,])
df['Dependents'] = df['Dependents'].fillna(df['Dependents'].mode().iloc[0])
df['CoapplicantIncome'] = df['CoapplicantIncome'].astype(int)
df['LoanAmount'] = df['LoanAmount'].astype(int)
df.isnull().sum()
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
四、将分类数据转换为数字形式
def cat_to_num(df, c_var):
for i in c_var:
uniques_value = df[i].unique()
df[i].replace(uniques_value, [0, 1], inplace=True)
for i in ['Property_Area']:
uniques_value = df[i].unique()
df[i].replace(uniques_value, [0, 1, 3], inplace=True)
c_variables = ['Gender', 'Married', 'Education', 'Education','Self_Employed', 'Loan_Status']
cat_to_num(df, c_variables)
df.head()
| Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1.0 | 0 | 0 | 4583 | 1508 | 128 | 360 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0.0 | 0 | 1 | 3000 | 0 | 66 | 360 | 1 | 1 | 1 |
| 2 | 0 | 0 | 0.0 | 1 | 0 | 2583 | 2358 | 120 | 360 | 1 | 1 | 1 |
| 3 | 0 | 1 | 0.0 | 0 | 0 | 6000 | 0 | 141 | 360 | 1 | 1 | 1 |
| 4 | 0 | 0 | 0.0 | 1 | 0 | 2333 | 1516 | 95 | 360 | 1 | 1 | 1 |
五、数据可视化
分析分配给列的分类值
fig, ax = plt.subplots(3, 2, figsize=(12,15))
for index, cat_col in enumerate(c_variables):
row, col = index//2, index%2
sns.countplot(x=cat_col, data=df, hue='Loan_Status', ax=ax[row, col])
plt.subplots_adjust(hspace=1)
分析数值列
numerical_columns = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']
fig,axes = plt.subplots(1,3,figsize=(17,5))
for idx,cat_col in enumerate(numerical_columns):
sns.boxplot(y=cat_col,data=df,x='Loan_Status',ax=axes[idx])
print(df[numerical_columns].describe())
plt.subplots_adjust(hspace=1)
六、数据预处理
X = df.drop(['Loan_Status'], axis=1)
y = df['Loan_Status']
X.shape, y.shape
((381, 11), (381,))
划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((304, 11), (304,), (77, 11), (77,))
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
七、模型——决策树分类器
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
model = DecisionTreeClassifier(max_depth=3,min_samples_leaf = 35)
model.fit(X_train,y_train)
八、测试
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
roc_score = roc_auc_score(y_test, y_pred)
print(f'Accuracy Score: {accuracy*100:0.2f}%')
print(f'Roc Score: {roc_score*100:0.2f}%')
Accuracy Score: 81.82%
Roc Score: 66.67%
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
| Predicted | 0 | 1 | All |
|---|---|---|---|
| True | |||
| 0 | 7 | 14 | 21 |
| 1 | 0 | 56 | 56 |
| All | 7 | 70 | 77 |
8921

被折叠的 条评论
为什么被折叠?



