I like your work. I’ll need this back NLT Friday. If you can’t do this, please let me know.
Home Equity Loan Classification Analysis
We begin by loading relevant R libraries needed for analysis. Recall that the hashtag symbol (#) is used to denote comments in the R code.
In [1]:
Data Import Next we read in the data. Be sure that the dataset file has been uploaded to your notebook before attempting to read the data in.
In [2]:
Examine the structure and summary of the dataset.
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ─ ─ ✔ ggplot2 3.2.0 ✔ purrr 0.3.3 ✔ tibble 2.1.3 ✔ dplyr 0.8.1 ✔ tidyr 0.8.3 ✔ stringr 1.4.0 ✔ readr 1.3.1 ✔ forcats 0.4.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ─ ─ ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
library(tidyverse) #for data cleaning and visualization library(caTools) #for splitting library(rpart) #for creating classification trees library(rpart.plot) #for plotting classification trees
home = read.csv(“Home Equity Loan Dataset.csv”) #reads-in the dataset that is in
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [3]:
There are 5,960 observations (rows) and 13 variables (columns) in this dataset. The variables are:
BAD Customer default on loan, “Yes” or “No” LOAN_AMT Amount of home equity loan MORTGAGE_REMAIN How much is owed on home mortgage PROPERTY_VALUE Value of property from which equity is being borrowed REASON Customer’s stated reason for home equity loan JOB Customer’s job YRS_JOB How long at current job DEROG Number of derogatory “marks” on customer’s credit history
‘data.frame’: 5960 obs. of 13 variables: $ BAD : Factor w/ 2 levels “No”,”Yes”: 2 2 2 2 1 2 2 2 2 2 … $ LOAN_AMT : int 1100 1300 1500 1500 1700 1700 1800 1800 2000 200 0 … $ MORTGAGE_REMAIN : num 25860 70053 13500 NA 97800 … $ PROPERTY_VALUE : num 39025 68400 16700 NA 112000 … $ REASON : Factor w/ 2 levels “DebtCon”,”HomeImp”: 2 2 2 NA 2 2 2 2 2 2 … $ JOB : Factor w/ 6 levels “Mgr”,”Office”,..: 3 3 3 NA 2 3 3 3 3 5 … $ YRS_JOB : num 10.5 7 4 NA 3 9 5 11 3 16 … $ DEROG : int 0 0 0 NA 0 0 3 0 0 0 … $ DELINQ : int 0 2 0 NA 0 0 2 0 2 0 … $ OLDEST_CRED_LINE_MTHS: num 94.4 121.8 149.5 NA 93.3 … $ NUM_RECENT_INQ : int 1 0 1 NA 0 1 1 0 1 0 … $ NUM_CRED_LINES : int 9 14 10 NA 14 8 17 8 12 13 … $ DEBT_INC_RATIO : num NA NA NA NA NA 37.1 NA 36.9 NA NA …
BAD LOAN_AMT MORTGAGE_REMAIN PROPERTY_VALUE REASON No :4771 Min. : 1100 Min. : 2063 Min. : 8000 DebtCon:3928 Yes:1189 1st Qu.:11100 1st Qu.: 46276 1st Qu.: 66076 HomeImp:1780 Median :16300 Median : 65019 Median : 89236 NA’s : 252 Mean :18608 Mean : 73761 Mean :101776 3rd Qu.:23300 3rd Qu.: 91488 3rd Qu.:119824 Max. :89900 Max. :399550 Max. :855909 NA’s :518 NA’s :112 JOB YRS_JOB DEROG DELINQ Mgr : 767 Min. : 0.000 Min. : 0.0000 Min. : 0.0000 Office : 948 1st Qu.: 3.000 1st Qu.: 0.0000 1st Qu.: 0.0000 Other :2388 Median : 7.000 Median : 0.0000 Median : 0.0000 ProfExe:1276 Mean : 8.922 Mean : 0.2546 Mean : 0.4494 Sales : 109 3rd Qu.:13.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Self : 193 Max. :41.000 Max. :10.0000 Max. :15.0000 NA’s : 279 NA’s :515 NA’s :708 NA’s :580 OLDEST_CRED_LINE_MTHS NUM_RECENT_INQ NUM_CRED_LINES DEBT_INC_RATIO Min. : 0.0 Min. : 0.000 Min. : 0.0 Min. : 0.50 1st Qu.: 115.1 1st Qu.: 0.000 1st Qu.:15.0 1st Qu.: 29.10 Median : 173.5 Median : 1.000 Median :20.0 Median : 34.80 Mean : 179.8 Mean : 1.186 Mean :21.3 Mean : 33.78 3rd Qu.: 231.6 3rd Qu.: 2.000 3rd Qu.:26.0 3rd Qu.: 39.00 Max. :1168.2 Max. :17.000 Max. :71.0 Max. :203.30 NA’s :308 NA’s :510 NA’s :222 NA’s :1267
str(home) #structure of data summary(home) #summary of data
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
DELINQ Number of delinquent “marks” on customers’s credit history OLDEST_CRED_LINE_MTHS Age (in months) of customer’s oldest line of credit NUM_RECENT_INQ Number of recent credit inquiries on customer’s credit history DEBT_INC_RATIO The ratio of the customer’s debt to their income
The response variable is BAD and indicates whether or not the home equity loan customer defaults on the loan or not. The variable is binary and R has already correctly identified that the variable is categorical (R encodes categorical variables as “factors”).
There is missing data in almost all of the variables. We need to carefully consider how this missingness may impact our analysis. Is the missingness itself a predictor? For example, does a missing value for debt to income ratio (DEBT_INC_RATIO) mean that the customer has no debt? If true, this is likely to be an indicator that the customer is unlikely to default. Is it reasonable to replace the missing values for DEBT_INC_RATIO with zeroes? Is this a reasonable approach for other variables?
Deciding what to do about missing data should take place before we split the dataset. Note that there is no “right” answer as to what to do with the missingness in this problem. For the sake of time, I will choose to “ignore” the missing data problem and move straight into data splitting and model development. This is probably not a wise idea 🙂
Split the Data
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [4]:
Visualize Variables As noted above, the response variable in this problem is the binary variable: BAD. This variable indicates whether the loan eventually went into default or not.
We begin by visualizing the relationships between each of the potential predictor variables and the variable BAD.
For each variable, we look at one or more different plots or tables. The purpose of this is to visually determine if/how each variable is related to BAD.
‘data.frame’: 4172 obs. of 13 variables: $ BAD : Factor w/ 2 levels “No”,”Yes”: 2 2 1 2 2 2 2 2 1 2 … $ LOAN_AMT : int 1100 1500 1700 1800 1800 2000 2000 2000 2000 210 0 … $ MORTGAGE_REMAIN : num 25860 13500 97800 48649 28502 … $ PROPERTY_VALUE : num 39025 16700 112000 57037 43034 … $ REASON : Factor w/ 2 levels “DebtCon”,”HomeImp”: 2 2 2 2 2 2 N A 2 NA 2 … $ JOB : Factor w/ 6 levels “Mgr”,”Office”,..: 3 3 2 3 3 5 NA 3 1 3 … $ YRS_JOB : num 10.5 4 3 5 11 16 18 3 2.5 8 … $ DEROG : int 0 0 0 3 0 0 NA 0 0 0 … $ DELINQ : int 0 0 0 2 0 0 NA 0 0 1 … $ OLDEST_CRED_LINE_MTHS: num 94.4 149.5 93.3 77.1 88.8 … $ NUM_RECENT_INQ : int 1 1 0 1 0 0 NA 2 0 0 … $ NUM_CRED_LINES : int 9 10 14 17 8 13 NA 25 24 16 … $ DEBT_INC_RATIO : num NA NA NA NA 36.9 NA NA NA NA NA … ‘data.frame’: 1788 obs. of 13 variables: $ BAD : Factor w/ 2 levels “No”,”Yes”: 2 2 2 2 2 2 2 2 2 2 … $ LOAN_AMT : int 1300 1500 1700 2000 2000 2200 2400 2400 2400 250 0 … $ MORTGAGE_REMAIN : num 70053 NA 30548 32700 20627 … $ PROPERTY_VALUE : num 68400 NA 40320 46740 29800 … $ REASON : Factor w/ 2 levels “DebtCon”,”HomeImp”: 2 NA 2 2 2 NA 2 2 2 2 … $ JOB : Factor w/ 6 levels “Mgr”,”Office”,..: 3 NA 3 3 2 NA 1 1 3 NA … $ YRS_JOB : num 7 NA 9 3 11 19 12 22 NA 18 … $ DEROG : int 0 NA 0 0 0 NA 0 NA 0 0 … $ DELINQ : int 2 NA 0 2 1 NA 0 2 0 0 … $ OLDEST_CRED_LINE_MTHS: num 122 NA 101 217 123 … $ NUM_RECENT_INQ : int 0 NA 1 1 1 NA 2 0 3 1 … $ NUM_CRED_LINES : int 14 NA 8 12 9 NA 22 10 4 19 … $ DEBT_INC_RATIO : num NA NA 37.1 NA NA 3.7 NA NA NA NA …
set.seed(123) #set random number seed for consistent splitting split = sample.split(home$BAD, SplitRatio = 0.7) #70% in training set, 30% in tes train = subset(home, split == TRUE) test = subset(home, split == FALSE) str(train) str(test)
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [5]:
The plots above suggest that smaller loan amounts may be more susceptible to default (BAD = Yes). Does this match what we would expect?
ggplot(train,aes(x=BAD,y=LOAN_AMT)) + geom_boxplot() #boxplots ggplot(train,aes(x=LOAN_AMT)) + geom_histogram(binwidth=1000) + facet_wrap(~BAD,n
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [6]:
As with loan amount, smaller remaining mortgage amounts seem to lead to a susceptibility to default. Notice the warning messages above the plots. These indicate that there is missing data (missing values for MORTGAGE_REMAIN). These values are not plotted. What do these missing values mean? Do they imply that the customer does not have a mortgage?
Warning message: “Removed 348 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 348 rows containing non-finite values (stat_bin).”
ggplot(train,aes(x=BAD,y=MORTGAGE_REMAIN)) + geom_boxplot() #boxplots ggplot(train,aes(x=MORTGAGE_REMAIN)) + geom_histogram(binwidth=10000) + facet_wra
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [7]:
Lower property values seem to suggest a slight susceptibility to default.
Warning message: “Removed 79 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 79 rows containing non-finite values (stat_bin).”
ggplot(train,aes(x=BAD,y=PROPERTY_VALUE)) + geom_boxplot() #boxplots ggplot(train,aes(x=PROPERTY_VALUE)) + geom_histogram(binwidth=10000) + facet_wrap
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [8]:
We use a table here because we are comparing two categorical variables (BAD and REASON). The first table is simply a count of the number of customers by BAD and REASON. The second table shows the proportion of customers (by REASON) that defaulted or not.
The proportions table suggests a difference in default rate between debt consolidation and home improvement loans.
In [9]:
The table above suggests that there might be a difference in default rate by job type. For example, Sales and Self appear to have a significantly higher default rate than other jobs.
DebtCon HomeImp <NA> No 3183 1384 204 Yes 745 396 48
DebtCon HomeImp <NA> No 0.8103360 0.7775281 0.8095238 Yes 0.1896640 0.2224719 0.1904762
Mgr Office Other ProfExe Sales Self <NA> No 588 823 1834 1064 71 135 256 Yes 179 125 554 212 38 58 23
Mgr Office Other ProfExe Sales Self No 0.76662321 0.86814346 0.76800670 0.83385580 0.65137615 0.69948187 Yes 0.23337679 0.13185654 0.23199330 0.16614420 0.34862385 0.30051813 <NA> No 0.91756272 Yes 0.08243728
table(home$BAD,home$REASON, exclude = NULL) #raw counts in cross-tabulation prop.table(table(home$BAD,home$REASON, exclude = NULL),margin=2) #proportions #exclude = NULL is used in the table code above so that we can see if having a mi #predictive of default
table(home$BAD,home$JOB, exclude = NULL) #raw counts in cross-tabulation prop.table(table(home$BAD,home$JOB, exclude = NULL),margin=2) #proportions
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [10]:
Fewere years at current job seems to slightly predict a higher default rate.
Warning message: “Removed 356 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 356 rows containing non-finite values (stat_bin).”
ggplot(train,aes(x=BAD,y=YRS_JOB)) + geom_boxplot() #boxplots ggplot(train,aes(x=YRS_JOB)) + geom_histogram(binwidth=2) + facet_wrap(~BAD,ncol=
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [11]:
Although not perfectly consistent across all values, increasing numbers of derogatory values appears to lead to higher default rates.
In [12]:
As with derogatory values, higher delinquency seems to be related to higher default rates.
0 1 2 3 4 5 6 7 8 9 10 <NA> No 3773 266 78 15 5 8 5 0 0 0 0 621 Yes 754 169 82 43 18 7 10 8 6 3 2 87
0 1 2 3 4 5 6 No 0.8334438 0.6114943 0.4875000 0.2586207 0.2173913 0.5333333 0.3333333 Yes 0.1665562 0.3885057 0.5125000 0.7413793 0.7826087 0.4666667 0.6666667 7 8 9 10 <NA> No 0.0000000 0.0000000 0.0000000 0.0000000 0.8771186 Yes 1.0000000 1.0000000 1.0000000 1.0000000 0.1228814
0 1 2 3 4 5 6 7 8 10 11 12 13 15 No 3596 432 138 58 32 7 0 0 0 0 0 0 0 0 Yes 583 222 112 71 46 31 27 13 5 2 2 1 1 1 <NA> No 508 Yes 72
0 1 2 3 4 5 6 No 0.8604929 0.6605505 0.5520000 0.4496124 0.4102564 0.1842105 0.0000000 Yes 0.1395071 0.3394495 0.4480000 0.5503876 0.5897436 0.8157895 1.0000000 7 8 10 11 12 13 15 No 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 Yes 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 <NA> No 0.8758621 Yes 0.1241379
table(home$BAD,home$DEROG,exclude=NULL) #raw counts in cross-tabulation prop.table(table(home$BAD,home$DEROG,exclude=NULL),margin=2) #proportions
table(home$BAD,home$DELINQ,exclude=NULL) #raw counts in cross-tabulation prop.table(table(home$BAD,home$DELINQ,exclude=NULL),margin=2) #proportions
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [13]:
Having smaller values for oldest credit line (in months) seems to suggest a higher default rate.
Warning message: “Removed 215 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 215 rows containing non-finite values (stat_bin).”
ggplot(train,aes(x=BAD,y=OLDEST_CRED_LINE_MTHS)) + geom_boxplot() #boxplots ggplot(train,aes(x=OLDEST_CRED_LINE_MTHS)) + geom_histogram(binwidth=10) + facet_
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [14]:
Although not perfectly consistent, as number of inquiries increases, we (generally) see an increase in default rate.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 No 2135 1085 599 285 95 39 27 29 11 5 19 7 0 0 Yes 396 254 181 107 61 36 29 15 11 6 9 3 2 2 14 17 <NA> No 0 0 435 Yes 1 1 75
0 1 2 3 4 5 6 No 0.8435401 0.8103062 0.7679487 0.7270408 0.6089744 0.5200000 0.4821429 Yes 0.1564599 0.1896938 0.2320513 0.2729592 0.3910256 0.4800000 0.5178571 7 8 9 10 11 12 13 No 0.6590909 0.5000000 0.4545455 0.6785714 0.7000000 0.0000000 0.0000000 Yes 0.3409091 0.5000000 0.5454545 0.3214286 0.3000000 1.0000000 1.0000000 14 17 <NA> No 0.0000000 0.0000000 0.8529412 Yes 1.0000000 1.0000000 0.1470588
table(home$BAD,home$NUM_RECENT_INQ,exclude=NULL) #raw counts in cross-tabulation prop.table(table(home$BAD,home$NUM_RECENT_INQ,exclude=NULL),margin=2) #proportion
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [15]:
Number of credit lines does not appear to be particularly predictive.
Warning message: “Removed 158 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 158 rows containing non-finite values (stat_bin).”
ggplot(train,aes(x=BAD,y=NUM_CRED_LINES)) + geom_boxplot() #boxplots ggplot(train,aes(x=NUM_CRED_LINES)) + geom_histogram(binwidth=2) + facet_wrap(~BA
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [16]:
High debt to income ratios seem to predict a higher chance of default. Note that there is a large amount of missing data for this variable.
Classification Tree
Warning message: “Removed 888 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 888 rows containing non-finite values (stat_bin).”
ggplot(train,aes(x=BAD,y=DEBT_INC_RATIO)) + geom_boxplot() #boxplots ggplot(train,aes(x=DEBT_INC_RATIO)) + geom_histogram(binwidth=2) + facet_wrap(~BA
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [17]:
Predictions
In [18]:
pred1 No Yes 3834 338
pred1 No Yes No 3289 545 Yes 51 287
model1 = rpart(BAD ~., train, method = “class”) options(repr.plot.width = 7, repr.plot.height = 6) rpart.plot(model1)
pred1 = predict(model1, type = “class”) #predictions on training set table(pred1) table(pred1, train$BAD)
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/)
Microsoft
shill79
/
In [19]:
The accuracy of our model on the training and testing sets is around 85%. Is this good?
One benchmark approach is develop a naive prediction. A common way to do this is to examine the response variable and see which value is most common. Let’s look at the response variable BAD.
In [20]:
Here the value of “No” is more common than the “Yes” value. We can develop a naive model by predicting that everyone in the dataset will be classified as a “No”. How accurate would we be if we did this?
In [21]:
Our “naive” model accuracy would be about 80%. Our classification tree model improves upon this accuracy.
In [22]:
In [23]:
0.857142857142857
No Yes 3340 832
0.800575263662512
pred2 No Yes 1653 135
pred2 No Yes No 1408 245 Yes 23 112
0.850111856823266
(3289+287)/nrow(train) #accuracy on training set (in-sample accuracy)
table(train$BAD) #examine the values of this variable
3340/nrow(train) #naive accuracy
If we are satisfied with our classification tree model and it is our “final” model, we can then evaluate the model’s performance on our testing set.
pred2 = predict(model1, test, type = “class”) table(pred2) table(pred2, test$BAD)
(1408+112)/nrow(test) #accuracy on test set
Azure Notebooks (/#)
Preview (/help/preview)
My Projects(/shill79/projects#) Help
(https://docs.microsoft.com/en- us/azure/notebooks/) shill79