The purpose of this notebook is to work through the set of exercises provided by Kenneth Train for his Discrete Choice Models class, and worked through by Yves Croissant in (Kenneth Train's Exercises using mlogit package). This will teach me how to apply several discrete choice models, including multinomial logit and nested logit.
library(mlogit); library(data.table); library(tidyverse); library(scales)
options(repr.plot.width=6, repr.plot.height=3)
data(Heating)
head(Heating)
There are 5 types of heating systems possible:
Other variables:
We first arrange the data in a format suitable for running the logistic regression.
ic_vars <- paste0("ic.", c("gc", "gr", "ec", "er", "hp"))
oc_vars <- paste0("oc.", c("gc", "gr", "ec", "er", "hp"))
H <- mlogit.data(Heating, shape="wide", choice="depvar", varying=c(ic_vars, oc_vars))
head(H)
m <- mlogit(depvar ~ ic + oc | 0, H) # Constrain intercept to be 0
summary(m)
We see that both ic and oc have negative and significant coefficients, which means that as the cost of a heating system rises, the probability of it being chosen falls. This makes sense.
We can now calculate the predicted probabilities by the model for each alternative heating system.
plot_actual_vs_predicted_shares <- function(model) {
predicted_shares = apply(model$probabilities, MARGIN=2, FUN=mean)
actual_shares = as.vector(model$freq) / sum(model$freq)
df = data.frame(predicted = predicted_shares, actual = actual_shares, system = names(predicted_shares))
df = melt(df, id.var="system")
ggplot(df, aes(x=system, y=value, col=variable, group=variable)) +
geom_point(alpha=0.6) + geom_line(alpha=0.6) + theme_minimal() +
scale_y_continuous(labels=percent_format(), limits=c(0,1)) + ylab("Share of Market")
}
plot_actual_vs_predicted_shares(m)
We can see that the predicted probabilities are slightly off.
The ratio of coefficients usually provides economically meaningful information. The willingness to pay (wtp) through higher installation cost for a one-dollar reduction in operating costs is the ratio of the operating cost coefficient to the installation cost coefficient. We can look at the ratio of coefficients between oc and ic to see if the model is reasonable.
beta_oc = as.numeric(coef(m)["oc"]); print(beta_oc)
beta_ic = as.numeric(coef(m)["ic"]); print(beta_ic)
ratio = beta_oc / beta_ic; message("beta_oc / beta_ic is: ", ratio)
The coefficient is 0.73, which means that people have the same utility, or probability of purchase if they were to decrease annual operating cost (oc) by \$1 and *increase* annual installation cost (ic) by \$0.73. This cannot be right, as oc is a recurring cost, so they should be willing to pay more than \$1 in ic to get a \$1 reduction in oc. The ratio should be more than 1.0.
To account for the present value of future operating costs, we can estimate the PV of annual operating costs as:
$$ oc_{pv} = \sum \frac{oc}{(1+r)^t}$$
Where t denotes the number of years and r is the discount rate of a dollar. As t increases, PV will approach:
$$oc_{pv} \approx \frac{oc}{r}$$
This means that for every \$1 reduction in oc, the buyer should be willing to pay $\$ \frac{1}{r}$ in ic upfront. We can calculate the discount value r implied by the findings of our regression and see if the implied discount rate is reasonable.
$$ U = a LC $$ where LC is the lifetime cost and a is a constant. Subbing in the components of LC:
$$ U = a(ic + \frac{oc}{r}) = a(ic) + \frac{a}{r}(oc) $$
The regression coefficients imply $$a = -0.00623$$ and $$\frac{a}{r} = -0.00458$$
So $$ discount = r = \frac{-0.00623}{-0.00458} = 1.36 = 136\%$$
This discount rate is far too high and obviously something is wrong with our regression.
Estimate a model that imposes the constraint that r = 0.12. Test the hypothesis that r = 0.12.
We first estimated the restricted model, then conduct a likelihood ratio test. The test statistic is:
$$ 2(LL_{U} - LL_{R})$$ and distributed as a chi-squared with degrees of freedom = number of restrictions.
calc_test_stat <- function(r_value=0.12) {
H$lcc = H$ic + (H$oc) / r_value
m_restricted = mlogit(depvar ~ lcc|0, H)
likelihood_test = lrtest(m, m_restricted)
test_stat = 2*(likelihood_test$LogLik[1] - likelihood_test$LogLik[2])
return(test_stat)
}
test_stat = calc_test_stat(0.12) # We can set the r value
critical_value = qchisq(0.05, df = 1, lower.tail = FALSE)
message("The test stat is ", test_stat, " and the critical value is ", critical_value, ".")
Since the test stat is higher than the critical value, we conclude that the restricted and unrestricted models are different and we reject the null hypothesis that the discount rate r is 0.12.
We can also test a range of r values.
df = data.frame(r_value = seq(0.01, 2, 0.01), test_stat=NA)
for (i in 1:nrow(df)) df$test_stat[i] <- calc_test_stat(df$r_value[i])
ggplot(df, aes(x=r_value, y=test_stat)) + geom_line() +
geom_hline(yintercept=3.84, col="red") + theme_minimal() + coord_cartesian(ylim=c(0,20)) +
labs(y="Test Statistic", x="r value", title="Test statistic vs different r values")
We can see that the test statistic is 0 at r value = 1.36, which makes the 2 models identical. But otherwise, when the r value is too different from 1.36, the likelihood test will reject the null hypothesis.
Add alternative-specific constants to the model. With J alternatives, at most J−1 alternative-specific constants can be estimated. The coefficients of J−1 constants are interpreted as relative to the Jth alternative. Normalize the constant for the alternative hp to 0.
model2 <- mlogit(depvar~ic+oc, H, reflevel = 'hp')
summary(model2)
Same as before, we can look at the predicted shares of each alternative relative to the actual shares. We can see that they are actually identical - having alternative-specific constants ensures that the shares will always be identical.
plot_actual_vs_predicted_shares(model2) + facet_wrap(~variable)
As before, we can look at the implied discount rate to see if it is reasonable. We see that the implied discount rate is 22%, which seems high but is much more reasonable than before.
beta_oc = as.numeric(coef(model2)["oc"]); print(beta_oc)
beta_ic = as.numeric(coef(model2)["ic"]); print(beta_ic)
ratio = beta_oc / beta_ic; message("beta_oc / beta_ic is: ", ratio)
r = beta_ic / beta_oc; message("Implied discount rate r is: ", r)
Now we try to re-estimate the same model, but change the reference level to gr. We see that with the shift, the coefficient for hp is now -0.308, which is the flip-side of the previous coefficient for gr.
update(model2, reflevel="gr")
Now we try a fuller model incorporating individual-specific effects of income. This model is not much better, as the improvement in log likelihood from -1008.2 to -1005.9 is not statistically significant using the log-likelihood test. Also, all the income terms are insignificant, suggesting that socio-demographic factors are not very important. This makes sense because it is the builder of the house that decides what heating system to put in, not the buyers.
model3 <- mlogit(depvar ~ oc + ic|income, H, reflevel="gc")
summary(model3)
lrtest(model2, model3)
We recall that the probability of choosing alternative i is given by: $$ P_{in} = {\frac {e^{V_{in}}}{ \sum e^{V_{jn}}} } $$ and suppose $$ V_{in} = \beta Z_{in} $$ where $Z_{in}$ is the alternative-specific variable we are interested in changing. We would like to calculate the marginal effect of changing $Z_{in}$ on the probability $P_{in}$, so we do so by taking the derivative, and show it to be: $$ {dP_{in} \over dZ_{in}} = \beta P_{in}(1-P_{in})$$
However, I do not know how to calculate the individual-specific marginal effect. That is, the marginal effect on probabilities of socio-demographic variables that vary across individuals, but not across alternatives. However, we can derive the marginal effects by a simple way - calculate probabilities before, adjust the variable, and calculate the probabilities again.
Suppose we are interested in how increasing income by 1 affects the probabilities on choosing different heating systems.
df_heating <- data.table(H)
df_heating[, alt := factor(alt, levels=c("gc", "ec", "er", "gr", "hp"))]
df_mean <- df_heating[, .(ic = mean(ic), oc = mean(oc), income = mean(income)), by=.(alt)][order(alt)]
df_mean
df_mean gives the mean values of ic, oc and income for each alternative, which represents the typical person. We then predict probabilities for this "person" before and after adjusting income, and compare effects.
test_data <- df_mean
before_prob <- predict(model3, newdata=test_data)
test_data$income <- test_data$income + 1
after_prob <- predict(model3, newdata=test_data)
df = data.frame(before = before_prob, after = after_prob)
df$marginal = df$after - df$before
df
We can compare the marginal column above to what the effects function of mlogit provides - we can see that they are very similar, with very slight discrepancies. I wonder where the discrepancy comes from.
effects(model3, covariate="income")
Another way to view marginal effects is to plot the effects of income on the typical household for a range of income values. We can see that richer households favour central heating than room heating.
plot_income_effects <- function(model, df_mean, income_values) {
temp = copy(df_mean)
df = data.frame()
for (i in 1:length(income_values)) {
current_income = income_values[i]
temp[, income := current_income]
probability = predict(model, newdata=temp)
temp_df = as.data.frame(probability)
temp_df$income = current_income
temp_df$alt = row.names(temp_df)
df = bind_rows(df, temp_df)
}
ggplot(df, aes(x=income, y=probability, col=alt)) + geom_line() + theme_minimal() +
scale_y_continuous(limits=c(0,1), labels=percent_format()) + ggtitle("Marginal effects of Income") +
scale_color_brewer(palette=2, type="qual")
}
plot_income_effects(model3, df_mean, seq(2, 7, 0.5))