Data Preparation

#Read the data and check the structure
payday <- read_csv("payday.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   credit.score = col_double(),
##   SES = col_double(),
##   loan = col_double(),
##   well.being = col_double(),
##   adverse.credit.event = col_double()
## )
glimpse(payday)
## Observations: 5,000
## Variables: 6
## $ id                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,...
## $ credit.score         <dbl> 590, 440, 470, 480, 570, 550, 550, 580, 540, 560, 410, 540, 570, 5...
## $ SES                  <dbl> 16, 14, 13, 14, 18, 17, 15, 18, 16, 14, 16, 15, 18, 16, 13, 20, 14...
## $ loan                 <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, ...
## $ well.being           <dbl> 5, 4, 3, 2, 7, 7, 4, 7, 5, 6, 4, 3, 5, 6, 4, 7, 5, 3, 1, 3, 3, 1, ...
## $ adverse.credit.event <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, ...
Data Dictionary Description
id Customer ID
credit.score credit score of the customers
loan whether or not people were given the payday loan (dummy variable)
SES customer’s socio econmic status, higher score mean higher status
well.being self-reported well-being of the customers (1-7 scales), 7 is the highest well-being
adverse.credit.event whether there was an adverse credit event in the next year (dummy variable)

Question 1 - Does receiving a payday loan change well-being? If so, how much?
Question 2 - Does taking a payday loan makes people more or less likely to experience an adverse credit event (e.g., defaulting on another loan, making late payments on a credit card, etc.)? Why doesn’t it matter whether or not to include socio-economic status?

Section 1 stnads only for the answer in plain english
Section 2 stands for the whole process of analysis


Question 1 Section 1

Well being score is expected to increase by 1.511 95% CI[1.445-1.577] for taking a payday loan, holding socio economic status (SES) constant. This effect of loan on well being is significant, \(t(4997) = 45.02\), \(p<.0001\) and \(F(1,4997) = 4665.4\), \(p<.0001\). If a customer has no loan, holding SES at its’ average, the estimated well being score is 3.23 95% CI[3.18-3.27]. If a customer has a loan, holding SES at its’ average, the estimated well being score increases to 4.74 95% CI[4.70-4.78]. The above figure visually explains the result. Most of the blue dots (customers with loan) are plotted higher than most of the red dots (customers without loan) in any given SES value. This pattern can be recognised more easily with the line graphs, as the blue line is placed higher than red line across all the SES values. Therefore, it is clear that customers who have loan are more likely to have higher well being score than those who do not, assuming they all have same SES. However, if SES is considered together with loan, it is difficult to simply claim that customers with payday loan are more likely to have higher well being than those who do not. This is because an extra SES also increases well being by 0.375 95% CI[0.362-0.387], holding loan status constant. This effect of SES on well being is significant, \(t(4997) = 59.31\), \(p<.0001\) and \(F(1,4997) = 3517.4\), \(p<.0001\). According to the line graphs, for example, customers with no loan and SES of 20 are more likely to have higher well being (\(≈ 5.0\)) than that of customers with loan and SES of 10 (\(≈ 2.5\)), even though the former customers do not have loan.


Question 2 Section 1

In the figure above, the blue dots (customers with loan) are more dominant on the lower side, while there are relatively more red dots (customers without loan) on the upper side. In addition, the blue line is placed lower than the red line. These patterns show that customers with payday loan are less likely to experience adverse credit event. According to the statistical analysis, this effect of loan on adverse credit event is significant, \(z(4998) = -16.080\), \(p<.0001\) and \(\chi^2(4998)=6644.6\), \(p<.0001\). If a customer has no loan, the probability of experiencing adverse credit event is 0.586 95% CI[0.566-0.606]. If a customer has a loan, the probability of experiencing averse credit event decreases to 0.357 95% CI[0.339-0.376].

Across all the Socio Economic Status (SES) values, however, both line graphs maintain a horizontal shape and the \(+\) signs randomly fluctuate in both colours. As a result, it is difficult to find out any clear pattern to predict the probability of adverse credit event when SES is taken into account. Thus, SES does not have a significant effect on adverse credit event, \(z(4996) = -0.434\), \(p = 0.6646\) and \(\chi^2(4997)=6643.7\), \(p = 0.3394\). Furthermore, it implies that the effect of loan on adverse credit event does not differ significantly by SES, \(z(4996) = -0.351\), \(p = 0.7259\) and \(\chi^2(4996) = 6643.6\), \(p = 0.7259\). Therefore, considering SES has no significant meaning when explaining the relationship between payday loan and adverse credit event.


Question 1 Section 2

Exploratory Data Analysis (EDA)

# Graphical view of each column data of the data set
grid.arrange(ggplot(payday, aes(x=loan)) + geom_bar(width=0.5),
             ggplot(payday, aes(x=well.being)) + geom_bar(),
             ggplot(payday, aes(x=credit.score)) + geom_bar(),
             ggplot(payday, aes(x=SES)) + geom_bar()
)

# Brief graphical view of relationships bewteen the well.being column (target column) and the other columns

## well.being VS loan
mean.wellbeing.loan <- payday %>% group_by(loan) %>% summarize(mean.well = mean(well.being))

ggplot(payday, aes(x=loan, y=well.being)) + geom_jitter(width = 0.07, alpha = 0.4, aes(col = factor(loan))) + geom_point(data = mean.wellbeing.loan, aes(x= loan, y = mean.well), shape=4) + geom_smooth(data = payday, mapping = aes(x=loan, y=well.being), method = "lm", se=FALSE, col ="black")

## well.being VS credit.score
ggplot(payday, aes(x=credit.score, y=well.being)) + geom_jitter(alpha = 0.4) + geom_smooth(method = "lm",
                                                                                  se=FALSE)

## well.being VS SES
ggplot(payday, aes(x=SES, y=well.being)) + geom_jitter(alpha=0.4) + geom_smooth(method = "lm", se=FALSE)

  • It is expected that taking a loan increases well being.
  • It is expected that an extra credit score increases well being.
  • It is expected that an extra SES increases well being.

Model Building

Model 1 (well.being VS loan)

\(\widehat{well.being} = \beta_{Intercept} + \beta_{loan} \times loan\)

# Making a linear regression model using loan as an independent variable
m.wellbeing.by.loan <- lm(well.being ~ loan, data = payday)

# Statistics data
summary(m.wellbeing.by.loan)
## 
## Call:
## lm(formula = well.being ~ loan, data = payday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0548 -1.0548 -0.0548  1.1103  4.1103 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.88967    0.02977   97.06   <2e-16 ***
## loan         2.16518    0.04137   52.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.462 on 4998 degrees of freedom
## Multiple R-squared:  0.354,  Adjusted R-squared:  0.3539 
## F-statistic:  2739 on 1 and 4998 DF,  p-value: < 2.2e-16
cbind(coefficient=coef(m.wellbeing.by.loan), confint(m.wellbeing.by.loan))
##             coefficient    2.5 %   97.5 %
## (Intercept)    2.889672 2.831306 2.948038
## loan           2.165175 2.084064 2.246286
( m.wellbeing.by.loan.emm <- summary(emmeans(m.wellbeing.by.loan, ~loan)) )
##  loan emmean     SE   df lower.CL upper.CL
##     0   2.89 0.0298 4998     2.83     2.95
##     1   5.05 0.0287 4998     5.00     5.11
## 
## Confidence level used: 0.95
anova(m.wellbeing.by.loan)
## Analysis of Variance Table
## 
## Response: well.being
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## loan         1  5852.6  5852.6  2738.6 < 2.2e-16 ***
## Residuals 4998 10680.9     2.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary:

  • 2.165 extra well being score is expected 95% CI[2.084-2.246] for taking a loan. This increase is significantly different from zero, \(t(4998) = 52.33\), \(p<.0001\).
  • If a customer has no loan, the estiated well being score is 2.89 95% CI[2.83-2.95].
  • If a customer has a loan, the estimated well being score is 5.05 95% CI[5.00-5.11].
  • Adding loan as an independent variable to a model with only an intercept significantly improves the fit of the model, \(F(1,4998) = 2738.6\), \(p<.0001\).

Model 2 (well.being VS credit.score and loan)

# Checking whether the interaction of credit.score and loan should be considered in the model
m.wellbeing.by.credit.loan.interaction <- lm(well.being ~ credit.score * loan, data = payday)
anova(m.wellbeing.by.credit.loan.interaction)
## Analysis of Variance Table
## 
## Response: well.being
##                     Df Sum Sq Mean Sq   F value  Pr(>F)    
## credit.score         1 8163.1  8163.1 4879.6347 < 2e-16 ***
## loan                 1    7.5     7.5    4.5079 0.03379 *  
## credit.score:loan    1    5.0     5.0    3.0127 0.08268 .  
## Residuals         4996 8357.8     1.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ineteraction credit.score:loan is not required as the effect of loan on well being does not differ significantly by credit.score, \(F(1,4996) = 3.013\), \(p = 0.08268\). Therefore, the model 2 is as follows:

\(\widehat{wellbeing} = \beta_{Intercept} + \beta_{credit.score} \times credit.score + \beta_{loan} \times loan\)

# Making linear regression model using credit.scroe and loan as independent variables
m.wellbeing.by.credit.loan <- lm(well.being ~ credit.score + loan, data = payday)

# Statistics data
summary(m.wellbeing.by.credit.loan)
## 
## Call:
## lm(formula = well.being ~ credit.score + loan, data = payday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0654 -0.9326  0.0472  0.9346  4.5368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.6300452  0.2838833 -26.877   <2e-16 ***
## credit.score  0.0234726  0.0006307  37.217   <2e-16 ***
## loan         -0.1533914  0.0722609  -2.123   0.0338 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.294 on 4997 degrees of freedom
## Multiple R-squared:  0.4942, Adjusted R-squared:  0.494 
## F-statistic:  2441 on 2 and 4997 DF,  p-value: < 2.2e-16
cbind(coefficient=coef(m.wellbeing.by.credit.loan), confint(m.wellbeing.by.credit.loan))
##              coefficient       2.5 %      97.5 %
## (Intercept)  -7.63004523 -8.18658107 -7.07350938
## credit.score  0.02347256  0.02223612  0.02470899
## loan         -0.15339137 -0.29505447 -0.01172828
( m.wellbeing.by.credit.loan.emm <- summary(emmeans(m.wellbeing.by.credit.loan, ~credit.score + loan)) )
##  credit.score loan emmean     SE   df lower.CL upper.CL
##           499    0   4.09 0.0416 4997     4.01     4.17
##           499    1   3.94 0.0394 4997     3.86     4.01
## 
## Confidence level used: 0.95
anova(m.wellbeing.by.credit.loan)
## Analysis of Variance Table
## 
## Response: well.being
##                Df Sum Sq Mean Sq  F value  Pr(>F)    
## credit.score    1 8163.1  8163.1 4877.670 < 2e-16 ***
## loan            1    7.5     7.5    4.506 0.03382 *  
## Residuals    4997 8362.8     1.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary:

  • Well being score is expected to increase by 0.023 95% CI[0.022-0.025] for an extra credit score, holding the loan status without any change. This increase in well being by credit score is significantly different from zero, \(t(4997) = 37.217\), \(p<.0001\).
  • Well being scroe is expected to decrease by 0.15 95% CI[0.011-0.295] for taking a loan, holding the credit.score constant. This decrease in well being by loan status is significantly different from zero, \(t(4997) = -2.123\), \(p = 0.0338\).
  • Holding the credit.score at its’ average (499), if a customer has no loan, the estiated well being score is 4.09 95% CI[4.01-4.17].
  • Holding the credit.score at its’ average (499), if a customer has a loan, the estimated well being score is 3.94 95% CI[3.86-4.01].
  • Adding credit.score as an independent variable to a model with only an intercept significantly improves the fit of the model, \(F(1,4997) = 4877.670\), \(p<.0001\).
  • Adding loan as an independent variable to a model with an intercept and credit.score significantly improves the fit of the model, \(F(1,4997) = 4.506\), \(p = 0.03382\)."

Discussion on credit.score

In the model 1, loan status is clearly a significant predictor on well being score, \(t(4998) = 52.33\), \(p<.0001\). However, in the model 2, loan status becomes less significant predictor with relatively high p value, \(t(4997) = -2.123\), \(p = 0.0338\). Furthermore, the decrease of well being by having a loan also does not correspond to the result of EDA and model 1, as there seem to be a positive correlation between loan and well being.

# Correlation check between the attributes
round(cor(payday[,c("credit.score", "loan", "SES")]), digits = 1)^2
##              credit.score loan  SES
## credit.score         1.00 0.81 0.16
## loan                 0.81 1.00 0.09
## SES                  0.16 0.09 1.00

This is mainly because there is a multicollinearity between loan and credit.score. As shown in the table above, there is a significant positive correlation (\(r^2 = 0.81\), \(N = 5000\)) between credit score and loan. This correlation can be explained by the fact that ‘Everyone applied for a payday loan, and those with credit scores of 500 or over received the loan’. The loan status is decided depending on the credit scores. Therefore, credit score explains well being much better and undermines the significance of loan as a predictor, when both loan and credit score are taken into account in a model.

# Graphical analysis showing the multicollinearity effect between credit.score an loan

## Making deciles for credit.score column and make new linear regression model with credit cut and loan
## This data will be used to make figure
credit.deciles <- quantile(pull(payday, credit.score), seq(0,1,.1))
payday.credit.cut <- payday %>% mutate(credit.cut = cut(credit.score, breaks=credit.deciles, include.lowest=TRUE))
m.wellbeing.credit.cut.loan <- lm(well.being ~ credit.cut + loan, data=payday.credit.cut)
m.wellbeing.credit.cut.loan.emm <- summary(emmeans(m.wellbeing.credit.cut.loan, ~credit.cut + loan))

## Visualisation: the impact of credit.score on loan's significance
grid.arrange(
    ggplot(payday.credit.cut, aes(y=well.being, x=loan)) + geom_jitter(width = 0.2) + geom_smooth(method = "lm", se=FALSE) + labs(x= "Loan", y="Well being"),
    ggplot(payday.credit.cut, aes(y=well.being, x=loan, col=credit.cut)) + geom_jitter(width = 0.2) + geom_line(data=m.wellbeing.credit.cut.loan.emm, aes(y=emmean, x=loan, col=credit.cut)) + guides(col=guide_legend(title="credit cut")) + labs(x= "Loan", y="Well being"),
    ncol=2 , widths=c(1,1.4), top = "The effect of credit score on loan"
)

This graph shows how the significance of loan changes due to the credit score. The left plot indicates model 1. When loan is the sole independent variable, it is a significant predictor of wellbeing, as shown in the blue regression line. The right plot breaks this down by deciles of cerdit score. It is clear that when credit score is held constant at any given credit deciles, loan has much smaller effect on the model with almost horizontal regression lines. That is, loan is not a significant predictor anymore when credit score is also taken into account.

It is obvious that credit score and loan have multicollinearity. Therefore, it is better not to consider credit score as an independent variable in order to answer the question ‘Does receiving a payday loan change well-being?’ more precisely. If credit score is considered with loan as independent variables, it dominates loan as a predictor. Then, the model is better to be used to answer the question ‘How does credit score affect well-being of customers?’.

Discussion on SES

# Checking whether the interaction of loan and SES is required in the model
m.wellbeing.by.loan.SES.interaction <- lm(well.being ~ loan * SES, data = payday)
anova(m.wellbeing.by.loan.SES.interaction)
## Analysis of Variance Table
## 
## Response: well.being
##             Df Sum Sq Mean Sq   F value Pr(>F)    
## loan         1 5852.6  5852.6 4664.8430 <2e-16 ***
## SES          1 4412.4  4412.4 3516.9253 <2e-16 ***
## loan:SES     1    0.5     0.5    0.3787 0.5383    
## Residuals 4996 6268.0     1.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ineteraction loan:SES is not required as the effect of loan on well being does not differ significantly by SES, \(F(1,4996) = 0.3787\), \(p = 0.5383\). Therefore, the model with loan and SES is as follows:

\(\widehat{wellbeing} = \beta_{Intercept} + \beta_{loan} \times loan + \beta_{SES} \times SES\)

# Making linear regression model using loan and SES as independent variables
m.wellbeing.by.loan.SES <- lm(well.being ~ loan + SES, data = payday)

# Statistics data
summary(m.wellbeing.by.loan.SES)
## 
## Call:
## lm(formula = well.being ~ loan + SES, data = payday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8640 -0.7393  0.0114  0.7717  3.8963 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.394822   0.091977  -26.04   <2e-16 ***
## loan         1.511017   0.033563   45.02   <2e-16 ***
## SES          0.374876   0.006321   59.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.12 on 4997 degrees of freedom
## Multiple R-squared:  0.6209, Adjusted R-squared:  0.6207 
## F-statistic:  4091 on 2 and 4997 DF,  p-value: < 2.2e-16
cbind(coefficient=coef(m.wellbeing.by.loan.SES), confint(m.wellbeing.by.loan.SES))
##             coefficient      2.5 %     97.5 %
## (Intercept)  -2.3948217 -2.5751367 -2.2145068
## loan          1.5110173  1.4452184  1.5768161
## SES           0.3748761  0.3624844  0.3872679
( m.wellbeing.by.loan.SES.emm <- summary(emmeans(m.wellbeing.by.loan.SES, ~ SES + loan)) )
##  SES loan emmean     SE   df lower.CL upper.CL
##   15    0   3.23 0.0235 4997     3.18     3.27
##   15    1   4.74 0.0226 4997     4.70     4.78
## 
## Confidence level used: 0.95
anova(m.wellbeing.by.loan.SES)
## Analysis of Variance Table
## 
## Response: well.being
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## loan         1 5852.6  5852.6  4665.4 < 2.2e-16 ***
## SES          1 4412.4  4412.4  3517.4 < 2.2e-16 ***
## Residuals 4997 6268.5     1.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary:

  • Well being score is expected to increase by 1.511 95% CI[1.445-1.577] for having a loan, holding SES without any change. This increase in well being by loan is significantly different from zero, \(t(4997) = 45.02\), \(p<.0001\).
  • Well being score is expected to increase by 0.375 95% CI[0.362-0.387] for an extra SES, holding the loan constant. This increase in well being by SES is significantly different from zero, \(t(4997) = 59.31\), \(p<.0001\).
  • Holding SES at its’ average, if a customer has no loan, the estimated well being score is 3.23 95% CI[3.18-3.27].
  • Holding SES at its’ average, if a customer has a loan , the estimated well being score is 4.74 95% CI[4.70-4.78].
  • Adding loan as an independent variable to a model with only an intercept significantly improves the fit of the model, \(F(1,4997) = 4665.4\), \(p<.0001\).
  • Adding SES as an independent variable to a model with an intercept and loan significantly improves the fit of the model, \(F(1,4997) = 3517.4\), \(p<.0001\)."

SES also has significant effect on well being when it is considered with loan in a model. According to the correlation matrix above, SES has low level of correlation (\(r^2 = 0.09\), \(N = 5000\)) with loan. Therefore, there seems to be no multicollinearity between SES and loan. That is, SES predicts well being independently of loan status, while hardly affects the significance of loan.

# Building model to be used to make figure
m.wellbeing.by.loan.factor.SES <- lm(well.being ~ loan + factor(SES), data = payday)
m.wellbeing.loan.factor.SES.emm <- summary(emmeans(m.wellbeing.by.loan.factor.SES, ~ SES + loan))

# Visualisation: the impact of SES on loan's significance 
grid.arrange(
    ggplot(payday, aes(y=well.being, x=loan)) + geom_jitter(width = 0.2) + geom_smooth(method = "lm", se=FALSE) + labs(x= "Loan", y="Well being"),
    ggplot(payday, aes(y=well.being, x=loan, col=factor(SES))) + geom_jitter(width = 0.2) + geom_line(data=m.wellbeing.loan.factor.SES.emm, aes(y=emmean, x=loan, col=factor(SES))) + guides(col=guide_legend(title="SES")) + labs(x= "Loan", y="Well being"),
    ncol=2 , widths=c(1,1.4), top = "The effect of SES on loan"
)

As shown in the figure above, the regression lines on both left (blue) and right (various colours) plots have almost same slope. This clearly shows that loan still maintains its significance as a predictor when SES is added as an independent variable and held constant at any given value, unlike the previous case of credit score and loan. Therefore, both loan and SES can be used into the linear model to enable more precise explanation on the effect of loan on well being. This is because considering more attributes can better reflect the real world of data and enables more dynamic approach to the question. In addition, the effect of loan on well being will indicate more realistic figure in the model with the two attributes.


Question 2 Section 2

Exploratory Data Analysis (EDA)

# Brief graphical view of relationships bewteen the adverse.credit.event column (target column) and the other columns

## adverse.credit.event VS loan
mean.adverse.loan <- payday %>% group_by(loan) %>% summarize(mean.adverse = mean(adverse.credit.event))

ggplot(payday, aes(x=loan, y=adverse.credit.event, col = factor(loan))) + geom_jitter(alpha = 0.4, width = 0.2, height = 0.2) + geom_point(data = mean.adverse.loan, aes(x=loan, y=mean.adverse, col = factor(loan)), size = 3) + geom_line(data = mean.adverse.loan, aes(x=loan, y=mean.adverse), col = "black")

## adverse.credit.event VS credit.score
ggplot(payday, aes(x=credit.score, y=adverse.credit.event)) + geom_jitter(alpha = 0.3, height = 0.1) + geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = FALSE)

## adverse.credit.event VS SES
ggplot(payday, aes(x=SES, y=adverse.credit.event)) + geom_jitter(alpha = 0.3, height = 0.1, width=0.5) + geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = FALSE)

  • It is expected that taking a loan decreases the probabiltiy of adverse credit events.
  • It is expected that an extra credit score decreases the probabiltiy of adverse credit events.
  • It is expected that an extra SES decreases the probabiltiy of adverse credit events.

Model Building

The dependent variable adverse credit event is binary variable (dummy variable). Therefore, logistic regression can be used to find out the model of the best fit.

Initial Model

\(\log(\frac{p}{1-p})= \beta_0 + \beta_{loan} loan\)

\(p = Probability \ of \ Adverse \ Credit \ Event\)

# Making logistic regression model with loan as an independent variable
m.adverse.by.loan.binom <- glm(adverse.credit.event ~ loan, family=binomial, data=payday)

# Statistics data 
summary(m.adverse.by.loan.binom)
## 
## Call:
## glm(formula = adverse.credit.event ~ loan, family = binomial, 
##     data = payday)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3282  -0.9396  -0.9396   1.0338   1.4355  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.34772    0.04135   8.409   <2e-16 ***
## loan        -0.93659    0.05825 -16.080   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6910.2  on 4999  degrees of freedom
## Residual deviance: 6644.6  on 4998  degrees of freedom
## AIC: 6648.6
## 
## Number of Fisher Scoring iterations: 4
cbind(coef(m.adverse.by.loan.binom),confint(m.adverse.by.loan.binom))
## Waiting for profiling to be done...
##                             2.5 %     97.5 %
## (Intercept)  0.3477171  0.2668526  0.4289591
## loan        -0.9365854 -1.0510014 -0.8226662
anova(m.adverse.by.loan.binom, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: adverse.credit.event
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                  4999     6910.2              
## loan  1    265.6      4998     6644.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
( m.adverse.by.loan.binom.emm <- summary(emmeans(m.adverse.by.loan.binom, ~loan, type="response")) )
##  loan  prob      SE  df asymp.LCL asymp.UCL
##     0 0.586 0.01003 Inf     0.566     0.606
##     1 0.357 0.00942 Inf     0.339     0.376
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

Summary:

  • Log odds (\(\log(\frac{p}{1-p})\)) of adverse credit event decreases by 0.937 95% CI[0.823-1.051] for having a loan. This decrease effect by having a loan is significant, \(z(4998) = -16.080\), \(p < .0001\) and \(\chi^2(4998)=6644.6\), \(p<.0001\).
  • If a customer has no loan, the probability of adverse credit event (\(p\)) is 0.586 95% CI[0.566-0.606].
  • If a customer has a loan, the probability of adverse credit event (\(p\)) is 0.357 95% CI[0.339-0.376].

According to the model summary, it becomes clear that taking a loan decreases the likelihood of experiencing adverse credit event. However, further analysis is required by adding other independent variables to the model.

Discussion on credit.score

# Checking the result after adding credit.score and credit.score:loan to the initial model as independent variables
m.adverse.by.credit.loan.binom <- glm(adverse.credit.event ~ credit.score * loan, family=binomial, data=payday)

summary(m.adverse.by.credit.loan.binom)
## 
## Call:
## glm(formula = adverse.credit.event ~ credit.score * loan, family = binomial, 
##     data = payday)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5277  -1.0904  -0.7696   1.1129   1.6983  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.453848   0.685759   6.495 8.32e-11 ***
## credit.score      -0.009150   0.001523  -6.007 1.89e-09 ***
## loan               0.802381   1.023256   0.784    0.433    
## credit.score:loan -0.001565   0.002065  -0.758    0.448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6910.2  on 4999  degrees of freedom
## Residual deviance: 6547.3  on 4996  degrees of freedom
## AIC: 6555.3
## 
## Number of Fisher Scoring iterations: 4
cbind(coef(m.adverse.by.credit.loan.binom),confint(m.adverse.by.credit.loan.binom))
## Waiting for profiling to be done...
##                                       2.5 %       97.5 %
## (Intercept)        4.453848178  3.114390496  5.803207115
## credit.score      -0.009149602 -0.012145371 -0.006173168
## loan               0.802380837 -1.202611359  2.809058614
## credit.score:loan -0.001564860 -0.005612113  0.002481966
anova(m.adverse.by.credit.loan.binom, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: adverse.credit.event
## 
## Terms added sequentially (first to last)
## 
## 
##                   Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                               4999     6910.2             
## credit.score       1   362.30      4998     6547.9   <2e-16 ***
## loan               1     0.08      4997     6547.8   0.7805    
## credit.score:loan  1     0.57      4996     6547.3   0.4485    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary:

  • Log odds (\(\log(\frac{p}{1-p})\)) of adverse credit event decreases by 0.009 95% CI[0.006-0.012] for an extra credit.score, holding loan and credit.score:loan constant. This effect is significantly different from zero, \(z(4996) = -6.007\), \(p < .0001\).
  • Log odds (\(\log(\frac{p}{1-p})\)) of adverse credit event increases by 0.802 95% CI[-1.202-2.809] for taking a loan, holding credit.score and credit.score:loan constant. This effect is not significantly different from zero, \(z(4996) = 0.784\), \(p=0.433\).
  • Log odds (\(\log(\frac{p}{1-p})\)) of adverse credit event decreases by 0.002 95% CI[-0.002-0.006] for an extra credit.score:loan, holding credit.score and loan constant. This effect is not significantly different from zero, \(z(4996) = -0.758\), \(p=0.448\)
  • Adding credit.score as an independent variable to a model with only an intercept significantly improves the fit of the model, \(\chi^2(4998)=6547.9\), \(p<.0001\).
  • Adding loan as an independent variable to a model with an intercept and credit.score does not significantly improve the fit of the model, \(\chi^2(4997)=6547.8\), \(p=0.7805\)
  • Adding credit.score:loan as an independent variable to a model with an intercept, credit.score and credit.score:loan does not significantly improve the fit of the model, \(\chi^2(4996) = 6547.3\), \(p = 0.4485\).

Due to the multicollinearity between credit score and loan mentioned in the question 1 section 2, credit score dilutes the significance of loan as a predictor after adding it to the initial model. In addition, the effect of loan on adverse credit event does not differ significantly by credit score. Therefore, it is better not to consider credit score into the model building to figure out precise relationship between loan and adverse credit event, otherwise the model result will be better to be used to answer the question ‘How does credit score affect the likelihood of experiencing adverse credit event?’.

Discussion on SES

# Checking the result after adding SES and loan:SES to the initial model as independent variables
m.adverse.by.loan.SES.binom <- glm(adverse.credit.event ~ loan * SES, family=binomial, data=payday)

summary(m.adverse.by.loan.SES.binom)
## 
## Call:
## glm(formula = adverse.credit.event ~ loan * SES, family = binomial, 
##     data = payday)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3567  -0.9561  -0.9214   1.0392   1.4914  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.447556   0.234026   1.912   0.0558 .
## loan        -0.795326   0.353494  -2.250   0.0245 *
## SES         -0.007080   0.016331  -0.434   0.6646  
## loan:SES    -0.008152   0.023251  -0.351   0.7259  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6910.2  on 4999  degrees of freedom
## Residual deviance: 6643.6  on 4996  degrees of freedom
## AIC: 6651.6
## 
## Number of Fisher Scoring iterations: 4
cbind(coef(m.adverse.by.loan.SES.binom),confint(m.adverse.by.loan.SES.binom))
## Waiting for profiling to be done...
##                                2.5 %      97.5 %
## (Intercept)  0.447555859 -0.01057266  0.90718156
## loan        -0.795326469 -1.48884549 -0.10289929
## SES         -0.007080487 -0.03911666  0.02492799
## loan:SES    -0.008151671 -0.05374256  0.03741882
anova(m.adverse.by.loan.SES.binom, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: adverse.credit.event
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                      4999     6910.2             
## loan      1  265.597      4998     6644.6   <2e-16 ***
## SES       1    0.913      4997     6643.7   0.3394    
## loan:SES  1    0.123      4996     6643.6   0.7259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary:

  • Log odds (\(\log(\frac{p}{1-p})\)) of adverse credit event decreases by 0.795 95% CI[0.103-1.489] for taking a loan, holding SES and loan:SES constant. This effect is significantly different from zero, \(z(4996) = -2.250\), \(p=0.0245\).
  • Log odds (\(\log(\frac{p}{1-p})\)) of adverse credit event decreases by 0.007 95% CI[-0.025-0.039] for an extra SES, holding loan and loan:SES constant. This effect is not significantly different from zero, \(z(4996) = -0.434\), \(p=0.6646\).
  • Log odds (\(\log(\frac{p}{1-p})\)) of adverse credit event decreases by 0.008 95% CI[-0.037-0.054] for an extra loan:SES, holding loan and SES constant. This effect is not significantly different from zero, \(z(4996) = -0.351\), \(p=0.7259\).
  • Adding loan as an independent variable to a model with only an intercept significantly improves the fit of the model, \(\chi^2(4998)=6644.6\), \(p<.0001\).
  • Adding SES as an independent variable to a model with an intercept and loan does not significantly improves the fit of the model, \(\chi^2(4997)=6643.7\), \(p=0.3394\).
  • Adding loan:SES as an independent variable to a model with an intercept, credit.score and credit.score:loan does not significantly improves the fit of the model, \(\chi^2(4996)=6643.6\), \(p=0.7259\).

SES and loan:SES do not have significant effects on the adverse credit event in the presence of loan. Furthermore, adding SES and loan:SES to the initial model makes the fit of the model worse. This statistical result implies that SES has no significant effect either on adverse credit event or the effect of loan when it is considered in the initial model. Therefore, it does not matter to include SES to answer the question about ‘Whether taking a payday loan makes people more or less likely to experience an adverse credit event’. This is because the effect of loan on adverse credit event will not differ by SES and still be significant even after including SES.