#Read the data and check the structure
payday <- read_csv("payday.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   credit.score = col_double(),
##   SES = col_double(),
##   loan = col_double(),
##   well.being = col_double(),
##   adverse.credit.event = col_double()
## )
glimpse(payday)
## Observations: 5,000
## Variables: 6
## $ id                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,...
## $ credit.score         <dbl> 590, 440, 470, 480, 570, 550, 550, 580, 540, 560, 410, 540, 570, 5...
## $ SES                  <dbl> 16, 14, 13, 14, 18, 17, 15, 18, 16, 14, 16, 15, 18, 16, 13, 20, 14...
## $ loan                 <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, ...
## $ well.being           <dbl> 5, 4, 3, 2, 7, 7, 4, 7, 5, 6, 4, 3, 5, 6, 4, 7, 5, 3, 1, 3, 3, 1, ...
## $ adverse.credit.event <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, ...
| Data Dictionary | Description | 
|---|---|
| id | Customer ID | 
| credit.score | credit score of the customers | 
| loan | whether or not people were given the payday loan (dummy variable) | 
| SES | customer’s socio econmic status, higher score mean higher status | 
| well.being | self-reported well-being of the customers (1-7 scales), 7 is the highest well-being | 
| adverse.credit.event | whether there was an adverse credit event in the next year (dummy variable) | 
Question 1 - Does receiving a payday loan change well-being? If so, how much?
Question 2 - Does taking a payday loan makes people more or less likely to experience an adverse credit event (e.g., defaulting on another loan, making late payments on a credit card, etc.)? Why doesn’t it matter whether or not to include socio-economic status?
Section 1 stnads only for the answer in plain english
Section 2 stands for the whole process of analysis
Well being score is expected to increase by 1.511 95% CI[1.445-1.577] for taking a payday loan, holding socio economic status (SES) constant. This effect of loan on well being is significant, \(t(4997) = 45.02\), \(p<.0001\) and \(F(1,4997) = 4665.4\), \(p<.0001\). If a customer has no loan, holding SES at its’ average, the estimated well being score is 3.23 95% CI[3.18-3.27]. If a customer has a loan, holding SES at its’ average, the estimated well being score increases to 4.74 95% CI[4.70-4.78]. The above figure visually explains the result. Most of the blue dots (customers with loan) are plotted higher than most of the red dots (customers without loan) in any given SES value. This pattern can be recognised more easily with the line graphs, as the blue line is placed higher than red line across all the SES values. Therefore, it is clear that customers who have loan are more likely to have higher well being score than those who do not, assuming they all have same SES. However, if SES is considered together with loan, it is difficult to simply claim that customers with payday loan are more likely to have higher well being than those who do not. This is because an extra SES also increases well being by 0.375 95% CI[0.362-0.387], holding loan status constant. This effect of SES on well being is significant, \(t(4997) = 59.31\), \(p<.0001\) and \(F(1,4997) = 3517.4\), \(p<.0001\). According to the line graphs, for example, customers with no loan and SES of 20 are more likely to have higher well being (\(≈ 5.0\)) than that of customers with loan and SES of 10 (\(≈ 2.5\)), even though the former customers do not have loan.
In the figure above, the blue dots (customers with loan) are more dominant on the lower side, while there are relatively more red dots (customers without loan) on the upper side. In addition, the blue line is placed lower than the red line. These patterns show that customers with payday loan are less likely to experience adverse credit event. According to the statistical analysis, this effect of loan on adverse credit event is significant, \(z(4998) = -16.080\), \(p<.0001\) and \(\chi^2(4998)=6644.6\), \(p<.0001\). If a customer has no loan, the probability of experiencing adverse credit event is 0.586 95% CI[0.566-0.606]. If a customer has a loan, the probability of experiencing averse credit event decreases to 0.357 95% CI[0.339-0.376].
Across all the Socio Economic Status (SES) values, however, both line graphs maintain a horizontal shape and the \(+\) signs randomly fluctuate in both colours. As a result, it is difficult to find out any clear pattern to predict the probability of adverse credit event when SES is taken into account. Thus, SES does not have a significant effect on adverse credit event, \(z(4996) = -0.434\), \(p = 0.6646\) and \(\chi^2(4997)=6643.7\), \(p = 0.3394\). Furthermore, it implies that the effect of loan on adverse credit event does not differ significantly by SES, \(z(4996) = -0.351\), \(p = 0.7259\) and \(\chi^2(4996) = 6643.6\), \(p = 0.7259\). Therefore, considering SES has no significant meaning when explaining the relationship between payday loan and adverse credit event.
# Graphical view of each column data of the data set
grid.arrange(ggplot(payday, aes(x=loan)) + geom_bar(width=0.5),
             ggplot(payday, aes(x=well.being)) + geom_bar(),
             ggplot(payday, aes(x=credit.score)) + geom_bar(),
             ggplot(payday, aes(x=SES)) + geom_bar()
)
# Brief graphical view of relationships bewteen the well.being column (target column) and the other columns
## well.being VS loan
mean.wellbeing.loan <- payday %>% group_by(loan) %>% summarize(mean.well = mean(well.being))
ggplot(payday, aes(x=loan, y=well.being)) + geom_jitter(width = 0.07, alpha = 0.4, aes(col = factor(loan))) + geom_point(data = mean.wellbeing.loan, aes(x= loan, y = mean.well), shape=4) + geom_smooth(data = payday, mapping = aes(x=loan, y=well.being), method = "lm", se=FALSE, col ="black")
## well.being VS credit.score
ggplot(payday, aes(x=credit.score, y=well.being)) + geom_jitter(alpha = 0.4) + geom_smooth(method = "lm",
                                                                                  se=FALSE)
## well.being VS SES
ggplot(payday, aes(x=SES, y=well.being)) + geom_jitter(alpha=0.4) + geom_smooth(method = "lm", se=FALSE)
\(\widehat{well.being} = \beta_{Intercept} + \beta_{loan} \times loan\)
# Making a linear regression model using loan as an independent variable
m.wellbeing.by.loan <- lm(well.being ~ loan, data = payday)
# Statistics data
summary(m.wellbeing.by.loan)
## 
## Call:
## lm(formula = well.being ~ loan, data = payday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0548 -1.0548 -0.0548  1.1103  4.1103 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.88967    0.02977   97.06   <2e-16 ***
## loan         2.16518    0.04137   52.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.462 on 4998 degrees of freedom
## Multiple R-squared:  0.354,  Adjusted R-squared:  0.3539 
## F-statistic:  2739 on 1 and 4998 DF,  p-value: < 2.2e-16
cbind(coefficient=coef(m.wellbeing.by.loan), confint(m.wellbeing.by.loan))
##             coefficient    2.5 %   97.5 %
## (Intercept)    2.889672 2.831306 2.948038
## loan           2.165175 2.084064 2.246286
( m.wellbeing.by.loan.emm <- summary(emmeans(m.wellbeing.by.loan, ~loan)) )
##  loan emmean     SE   df lower.CL upper.CL
##     0   2.89 0.0298 4998     2.83     2.95
##     1   5.05 0.0287 4998     5.00     5.11
## 
## Confidence level used: 0.95
anova(m.wellbeing.by.loan)
## Analysis of Variance Table
## 
## Response: well.being
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## loan         1  5852.6  5852.6  2738.6 < 2.2e-16 ***
## Residuals 4998 10680.9     2.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary:
# Checking whether the interaction of credit.score and loan should be considered in the model
m.wellbeing.by.credit.loan.interaction <- lm(well.being ~ credit.score * loan, data = payday)
anova(m.wellbeing.by.credit.loan.interaction)
## Analysis of Variance Table
## 
## Response: well.being
##                     Df Sum Sq Mean Sq   F value  Pr(>F)    
## credit.score         1 8163.1  8163.1 4879.6347 < 2e-16 ***
## loan                 1    7.5     7.5    4.5079 0.03379 *  
## credit.score:loan    1    5.0     5.0    3.0127 0.08268 .  
## Residuals         4996 8357.8     1.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ineteraction credit.score:loan is not required as the effect of loan on well being does not differ significantly by credit.score, \(F(1,4996) = 3.013\), \(p = 0.08268\). Therefore, the model 2 is as follows:
\(\widehat{wellbeing} = \beta_{Intercept} + \beta_{credit.score} \times credit.score + \beta_{loan} \times loan\)
# Making linear regression model using credit.scroe and loan as independent variables
m.wellbeing.by.credit.loan <- lm(well.being ~ credit.score + loan, data = payday)
# Statistics data
summary(m.wellbeing.by.credit.loan)
## 
## Call:
## lm(formula = well.being ~ credit.score + loan, data = payday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0654 -0.9326  0.0472  0.9346  4.5368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.6300452  0.2838833 -26.877   <2e-16 ***
## credit.score  0.0234726  0.0006307  37.217   <2e-16 ***
## loan         -0.1533914  0.0722609  -2.123   0.0338 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.294 on 4997 degrees of freedom
## Multiple R-squared:  0.4942, Adjusted R-squared:  0.494 
## F-statistic:  2441 on 2 and 4997 DF,  p-value: < 2.2e-16
cbind(coefficient=coef(m.wellbeing.by.credit.loan), confint(m.wellbeing.by.credit.loan))
##              coefficient       2.5 %      97.5 %
## (Intercept)  -7.63004523 -8.18658107 -7.07350938
## credit.score  0.02347256  0.02223612  0.02470899
## loan         -0.15339137 -0.29505447 -0.01172828
( m.wellbeing.by.credit.loan.emm <- summary(emmeans(m.wellbeing.by.credit.loan, ~credit.score + loan)) )
##  credit.score loan emmean     SE   df lower.CL upper.CL
##           499    0   4.09 0.0416 4997     4.01     4.17
##           499    1   3.94 0.0394 4997     3.86     4.01
## 
## Confidence level used: 0.95
anova(m.wellbeing.by.credit.loan)
## Analysis of Variance Table
## 
## Response: well.being
##                Df Sum Sq Mean Sq  F value  Pr(>F)    
## credit.score    1 8163.1  8163.1 4877.670 < 2e-16 ***
## loan            1    7.5     7.5    4.506 0.03382 *  
## Residuals    4997 8362.8     1.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary:
In the model 1, loan status is clearly a significant predictor on well being score, \(t(4998) = 52.33\), \(p<.0001\). However, in the model 2, loan status becomes less significant predictor with relatively high p value, \(t(4997) = -2.123\), \(p = 0.0338\). Furthermore, the decrease of well being by having a loan also does not correspond to the result of EDA and model 1, as there seem to be a positive correlation between loan and well being.
# Correlation check between the attributes
round(cor(payday[,c("credit.score", "loan", "SES")]), digits = 1)^2
##              credit.score loan  SES
## credit.score         1.00 0.81 0.16
## loan                 0.81 1.00 0.09
## SES                  0.16 0.09 1.00
This is mainly because there is a multicollinearity between loan and credit.score. As shown in the table above, there is a significant positive correlation (\(r^2 = 0.81\), \(N = 5000\)) between credit score and loan. This correlation can be explained by the fact that ‘Everyone applied for a payday loan, and those with credit scores of 500 or over received the loan’. The loan status is decided depending on the credit scores. Therefore, credit score explains well being much better and undermines the significance of loan as a predictor, when both loan and credit score are taken into account in a model.
# Graphical analysis showing the multicollinearity effect between credit.score an loan
## Making deciles for credit.score column and make new linear regression model with credit cut and loan
## This data will be used to make figure
credit.deciles <- quantile(pull(payday, credit.score), seq(0,1,.1))
payday.credit.cut <- payday %>% mutate(credit.cut = cut(credit.score, breaks=credit.deciles, include.lowest=TRUE))
m.wellbeing.credit.cut.loan <- lm(well.being ~ credit.cut + loan, data=payday.credit.cut)
m.wellbeing.credit.cut.loan.emm <- summary(emmeans(m.wellbeing.credit.cut.loan, ~credit.cut + loan))
## Visualisation: the impact of credit.score on loan's significance
grid.arrange(
    ggplot(payday.credit.cut, aes(y=well.being, x=loan)) + geom_jitter(width = 0.2) + geom_smooth(method = "lm", se=FALSE) + labs(x= "Loan", y="Well being"),
    ggplot(payday.credit.cut, aes(y=well.being, x=loan, col=credit.cut)) + geom_jitter(width = 0.2) + geom_line(data=m.wellbeing.credit.cut.loan.emm, aes(y=emmean, x=loan, col=credit.cut)) + guides(col=guide_legend(title="credit cut")) + labs(x= "Loan", y="Well being"),
    ncol=2 , widths=c(1,1.4), top = "The effect of credit score on loan"
)
This graph shows how the significance of loan changes due to the credit score. The left plot indicates model 1. When loan is the sole independent variable, it is a significant predictor of wellbeing, as shown in the blue regression line. The right plot breaks this down by deciles of cerdit score. It is clear that when credit score is held constant at any given credit deciles, loan has much smaller effect on the model with almost horizontal regression lines. That is, loan is not a significant predictor anymore when credit score is also taken into account.
It is obvious that credit score and loan have multicollinearity. Therefore, it is better not to consider credit score as an independent variable in order to answer the question ‘Does receiving a payday loan change well-being?’ more precisely. If credit score is considered with loan as independent variables, it dominates loan as a predictor. Then, the model is better to be used to answer the question ‘How does credit score affect well-being of customers?’.
# Checking whether the interaction of loan and SES is required in the model
m.wellbeing.by.loan.SES.interaction <- lm(well.being ~ loan * SES, data = payday)
anova(m.wellbeing.by.loan.SES.interaction)
## Analysis of Variance Table
## 
## Response: well.being
##             Df Sum Sq Mean Sq   F value Pr(>F)    
## loan         1 5852.6  5852.6 4664.8430 <2e-16 ***
## SES          1 4412.4  4412.4 3516.9253 <2e-16 ***
## loan:SES     1    0.5     0.5    0.3787 0.5383    
## Residuals 4996 6268.0     1.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ineteraction loan:SES is not required as the effect of loan on well being does not differ significantly by SES, \(F(1,4996) = 0.3787\), \(p = 0.5383\). Therefore, the model with loan and SES is as follows:
\(\widehat{wellbeing} = \beta_{Intercept} + \beta_{loan} \times loan + \beta_{SES} \times SES\)
# Making linear regression model using loan and SES as independent variables
m.wellbeing.by.loan.SES <- lm(well.being ~ loan + SES, data = payday)
# Statistics data
summary(m.wellbeing.by.loan.SES)
## 
## Call:
## lm(formula = well.being ~ loan + SES, data = payday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8640 -0.7393  0.0114  0.7717  3.8963 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.394822   0.091977  -26.04   <2e-16 ***
## loan         1.511017   0.033563   45.02   <2e-16 ***
## SES          0.374876   0.006321   59.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.12 on 4997 degrees of freedom
## Multiple R-squared:  0.6209, Adjusted R-squared:  0.6207 
## F-statistic:  4091 on 2 and 4997 DF,  p-value: < 2.2e-16
cbind(coefficient=coef(m.wellbeing.by.loan.SES), confint(m.wellbeing.by.loan.SES))
##             coefficient      2.5 %     97.5 %
## (Intercept)  -2.3948217 -2.5751367 -2.2145068
## loan          1.5110173  1.4452184  1.5768161
## SES           0.3748761  0.3624844  0.3872679
( m.wellbeing.by.loan.SES.emm <- summary(emmeans(m.wellbeing.by.loan.SES, ~ SES + loan)) )
##  SES loan emmean     SE   df lower.CL upper.CL
##   15    0   3.23 0.0235 4997     3.18     3.27
##   15    1   4.74 0.0226 4997     4.70     4.78
## 
## Confidence level used: 0.95
anova(m.wellbeing.by.loan.SES)
## Analysis of Variance Table
## 
## Response: well.being
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## loan         1 5852.6  5852.6  4665.4 < 2.2e-16 ***
## SES          1 4412.4  4412.4  3517.4 < 2.2e-16 ***
## Residuals 4997 6268.5     1.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary:
SES also has significant effect on well being when it is considered with loan in a model. According to the correlation matrix above, SES has low level of correlation (\(r^2 = 0.09\), \(N = 5000\)) with loan. Therefore, there seems to be no multicollinearity between SES and loan. That is, SES predicts well being independently of loan status, while hardly affects the significance of loan.
# Building model to be used to make figure
m.wellbeing.by.loan.factor.SES <- lm(well.being ~ loan + factor(SES), data = payday)
m.wellbeing.loan.factor.SES.emm <- summary(emmeans(m.wellbeing.by.loan.factor.SES, ~ SES + loan))
# Visualisation: the impact of SES on loan's significance 
grid.arrange(
    ggplot(payday, aes(y=well.being, x=loan)) + geom_jitter(width = 0.2) + geom_smooth(method = "lm", se=FALSE) + labs(x= "Loan", y="Well being"),
    ggplot(payday, aes(y=well.being, x=loan, col=factor(SES))) + geom_jitter(width = 0.2) + geom_line(data=m.wellbeing.loan.factor.SES.emm, aes(y=emmean, x=loan, col=factor(SES))) + guides(col=guide_legend(title="SES")) + labs(x= "Loan", y="Well being"),
    ncol=2 , widths=c(1,1.4), top = "The effect of SES on loan"
)
As shown in the figure above, the regression lines on both left (blue) and right (various colours) plots have almost same slope. This clearly shows that loan still maintains its significance as a predictor when SES is added as an independent variable and held constant at any given value, unlike the previous case of credit score and loan. Therefore, both loan and SES can be used into the linear model to enable more precise explanation on the effect of loan on well being. This is because considering more attributes can better reflect the real world of data and enables more dynamic approach to the question. In addition, the effect of loan on well being will indicate more realistic figure in the model with the two attributes.
# Brief graphical view of relationships bewteen the adverse.credit.event column (target column) and the other columns
## adverse.credit.event VS loan
mean.adverse.loan <- payday %>% group_by(loan) %>% summarize(mean.adverse = mean(adverse.credit.event))
ggplot(payday, aes(x=loan, y=adverse.credit.event, col = factor(loan))) + geom_jitter(alpha = 0.4, width = 0.2, height = 0.2) + geom_point(data = mean.adverse.loan, aes(x=loan, y=mean.adverse, col = factor(loan)), size = 3) + geom_line(data = mean.adverse.loan, aes(x=loan, y=mean.adverse), col = "black")
## adverse.credit.event VS credit.score
ggplot(payday, aes(x=credit.score, y=adverse.credit.event)) + geom_jitter(alpha = 0.3, height = 0.1) + geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = FALSE)
## adverse.credit.event VS SES
ggplot(payday, aes(x=SES, y=adverse.credit.event)) + geom_jitter(alpha = 0.3, height = 0.1, width=0.5) + geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = FALSE)
The dependent variable adverse credit event is binary variable (dummy variable). Therefore, logistic regression can be used to find out the model of the best fit.
\(\log(\frac{p}{1-p})= \beta_0 + \beta_{loan} loan\)
\(p = Probability \ of \ Adverse \ Credit \ Event\)
# Making logistic regression model with loan as an independent variable
m.adverse.by.loan.binom <- glm(adverse.credit.event ~ loan, family=binomial, data=payday)
# Statistics data 
summary(m.adverse.by.loan.binom)
## 
## Call:
## glm(formula = adverse.credit.event ~ loan, family = binomial, 
##     data = payday)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3282  -0.9396  -0.9396   1.0338   1.4355  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.34772    0.04135   8.409   <2e-16 ***
## loan        -0.93659    0.05825 -16.080   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6910.2  on 4999  degrees of freedom
## Residual deviance: 6644.6  on 4998  degrees of freedom
## AIC: 6648.6
## 
## Number of Fisher Scoring iterations: 4
cbind(coef(m.adverse.by.loan.binom),confint(m.adverse.by.loan.binom))
## Waiting for profiling to be done...
##                             2.5 %     97.5 %
## (Intercept)  0.3477171  0.2668526  0.4289591
## loan        -0.9365854 -1.0510014 -0.8226662
anova(m.adverse.by.loan.binom, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: adverse.credit.event
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                  4999     6910.2              
## loan  1    265.6      4998     6644.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
( m.adverse.by.loan.binom.emm <- summary(emmeans(m.adverse.by.loan.binom, ~loan, type="response")) )
##  loan  prob      SE  df asymp.LCL asymp.UCL
##     0 0.586 0.01003 Inf     0.566     0.606
##     1 0.357 0.00942 Inf     0.339     0.376
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale
Summary:
According to the model summary, it becomes clear that taking a loan decreases the likelihood of experiencing adverse credit event. However, further analysis is required by adding other independent variables to the model.
# Checking the result after adding credit.score and credit.score:loan to the initial model as independent variables
m.adverse.by.credit.loan.binom <- glm(adverse.credit.event ~ credit.score * loan, family=binomial, data=payday)
summary(m.adverse.by.credit.loan.binom)
## 
## Call:
## glm(formula = adverse.credit.event ~ credit.score * loan, family = binomial, 
##     data = payday)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5277  -1.0904  -0.7696   1.1129   1.6983  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.453848   0.685759   6.495 8.32e-11 ***
## credit.score      -0.009150   0.001523  -6.007 1.89e-09 ***
## loan               0.802381   1.023256   0.784    0.433    
## credit.score:loan -0.001565   0.002065  -0.758    0.448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6910.2  on 4999  degrees of freedom
## Residual deviance: 6547.3  on 4996  degrees of freedom
## AIC: 6555.3
## 
## Number of Fisher Scoring iterations: 4
cbind(coef(m.adverse.by.credit.loan.binom),confint(m.adverse.by.credit.loan.binom))
## Waiting for profiling to be done...
##                                       2.5 %       97.5 %
## (Intercept)        4.453848178  3.114390496  5.803207115
## credit.score      -0.009149602 -0.012145371 -0.006173168
## loan               0.802380837 -1.202611359  2.809058614
## credit.score:loan -0.001564860 -0.005612113  0.002481966
anova(m.adverse.by.credit.loan.binom, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: adverse.credit.event
## 
## Terms added sequentially (first to last)
## 
## 
##                   Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                               4999     6910.2             
## credit.score       1   362.30      4998     6547.9   <2e-16 ***
## loan               1     0.08      4997     6547.8   0.7805    
## credit.score:loan  1     0.57      4996     6547.3   0.4485    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary:
Due to the multicollinearity between credit score and loan mentioned in the question 1 section 2, credit score dilutes the significance of loan as a predictor after adding it to the initial model. In addition, the effect of loan on adverse credit event does not differ significantly by credit score. Therefore, it is better not to consider credit score into the model building to figure out precise relationship between loan and adverse credit event, otherwise the model result will be better to be used to answer the question ‘How does credit score affect the likelihood of experiencing adverse credit event?’.
# Checking the result after adding SES and loan:SES to the initial model as independent variables
m.adverse.by.loan.SES.binom <- glm(adverse.credit.event ~ loan * SES, family=binomial, data=payday)
summary(m.adverse.by.loan.SES.binom)
## 
## Call:
## glm(formula = adverse.credit.event ~ loan * SES, family = binomial, 
##     data = payday)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3567  -0.9561  -0.9214   1.0392   1.4914  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.447556   0.234026   1.912   0.0558 .
## loan        -0.795326   0.353494  -2.250   0.0245 *
## SES         -0.007080   0.016331  -0.434   0.6646  
## loan:SES    -0.008152   0.023251  -0.351   0.7259  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6910.2  on 4999  degrees of freedom
## Residual deviance: 6643.6  on 4996  degrees of freedom
## AIC: 6651.6
## 
## Number of Fisher Scoring iterations: 4
cbind(coef(m.adverse.by.loan.SES.binom),confint(m.adverse.by.loan.SES.binom))
## Waiting for profiling to be done...
##                                2.5 %      97.5 %
## (Intercept)  0.447555859 -0.01057266  0.90718156
## loan        -0.795326469 -1.48884549 -0.10289929
## SES         -0.007080487 -0.03911666  0.02492799
## loan:SES    -0.008151671 -0.05374256  0.03741882
anova(m.adverse.by.loan.SES.binom, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: adverse.credit.event
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                      4999     6910.2             
## loan      1  265.597      4998     6644.6   <2e-16 ***
## SES       1    0.913      4997     6643.7   0.3394    
## loan:SES  1    0.123      4996     6643.6   0.7259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary:
SES and loan:SES do not have significant effects on the adverse credit event in the presence of loan. Furthermore, adding SES and loan:SES to the initial model makes the fit of the model worse. This statistical result implies that SES has no significant effect either on adverse credit event or the effect of loan when it is considered in the initial model. Therefore, it does not matter to include SES to answer the question about ‘Whether taking a payday loan makes people more or less likely to experience an adverse credit event’. This is because the effect of loan on adverse credit event will not differ by SES and still be significant even after including SES.