# Econometrics Refresher

It has been a bit quieter on my blog over the exam period and the Christmas break, but now university is back on again and I cannot wait to start my second semester! As we have our consolidation and development week this week, it seems like the perfect time to prepare for my classes and as the semester is going to be my last one before graduation, I can pick from a list of electives.

So, besides Behavioural Economics and Industrial Economics, I get the chance to do Applied Econometrics. The course builds upon the Econometrics component from my third-year classes in Micro- and Macroeconomics and introduces:

1. Models with limited dependent variables
2. Panel data sets
3. Topics involving time series data (volatility and cointegration).

One of the class prerequisites is a change in notation to the matrix form of econometrics. This is a very useful step as many econometric problems have a multivariate character and for this, the matrix form is much more convenient. While I have studied matrices in my advanced mathematics course as well as through MITOpenCourseware, I have yet to relate it to econometrics and this is what I am up to this week! Ben Lambert has a great graduate course in econometrics on his YouTube channel which covers topics from the undergraduate course in matrix formulation:

It is an indispensable source for developing your understanding and getting a feel for this representation. In particular, the course starts with an introduction to the matrix formulation of econometrics, followed by an example of it. It then continues with the differentiation with respect to a vector as well as the derivation of OLS estimators in matrix form. So, if you want or need to refresh your econometrics skills or develop them to the next level, this is for you.

Jasmin

# Voting Behaviour in the United Kingdom – Evidence from the European Social Survey 2012

My paper Applied Econometrics, which I am taking at Auckland University of Technology whilst being on student exchange, included a major study of Voting Behaviour across European countries. The assignment brief was as follows:

Using data from the 2012 European Social Survey write a research report on the factors associated with an individual’s likelihood to vote.

Each student could pick one of the European countries. I decided to focus on the UK and was pleased to conduct empirical work in STATA as a part of university. In particular, the goal of the assignment was to become proficient in the use of econometric techniques when dealing with a categorical variable. In this case it was voter turnout where people decided to vote (Y=1) or decided to abstain (Y=0). We were given the choice of either using a Logit or a Probit model. Before defining our own model it was recommended to carry out a literature review on the determinants of voting behaviour in order to include all significant variables that are commonly used. Thereafter the study should include an overview on the chosen model and its methodology as well as a discussion of the empirical results. In the discussion the focus should be on testing the results for their trustworthiness and any bias. While I did not correct for heteroskedasticity with the use of robust standard errors (which is the main criticism in my feedback), I tested for goodness of fit, model misspecification errors, multicollinearity and influential observations. The study should conclude with a brief summary of the main findings.

Overall I am really proud of my very first own ‘study’. I put an immense amount of effort into it in order to make my work perfect and flawless. This is also why I decided to publish my work on my blog. In addition, I hope to be able to use it as writing sample when applying for Economics graduate school (besides my bachelor’s thesis).

In retrospective, I have learned a lot over the course of my Applied Econometrics paper and I am very thankful that my home university let me choose my fourth paper freely and that, in turn, Auckland University of Technology approved my choice of Applied Econometrics as elective. I knew that it would be a challenging paper but it has been a very rewarding experience throughout.

I hope that you enjoy reading my work! The abstract is included below and complete study is available from here.

Jasse

Voting Behaviour in the United Kingdom: Evidence from the European Social Survey 2012

Abstract

Voting is often taken as an indicator for the state of a country’s democratic political system. The study therefore examines voting behaviour in the United Kingdom using data from the European Social Survey 6.0 conducted in 2012. It develops a model based on rational voter theory as well as sociological theories discussed in the literature while controlling for demographic factors. Dimensions included in the sociological approach of the model are deprivation, social capital and civic voluntarism.

The study concludes that British women are significantly more likely to vote than men after controlling for other factors. Other significant demographic factors are family status and age. Age does not exhibit a curvilinear pattern due to life-cycle effects. The deprivation dimension (ethnicity and immigration status) does not have a significant influence in the study while the social capital dimension does turn out to be significant. Trade union membership, religious denomination (Roman Catholic and Anglican) and a composite trust variable measuring one’s trust in others have a significant positive effect on voter turnout in the UK. Civic voluntarism is the most influential dimension for determining participation in the British general elections. Medium and low income households are significantly less likely to vote, ceteris paribus. In the model only tertiary education is a significant positive predictor compared to respondents with primary education only. There is no significant difference between primary and secondary education. Further vocational education does become significant once controlling for influential observations. Political interest and partisanship remain two of the most significant predictors of voting behaviour at the margin. The study concludes that there is a significant relationship between voter mobilisation and a person’s wealth and non-material endowment. This is of concern to ensure representative policies and civic engagement in the future and might also explain recent turnout declines.  A limitation of the study is the low explanatory power of the overall model even if variables are significant at the margin. This is taken as evidence for rational voter theory while being more problematic for sociological approaches.

Droege, J. (2016). Voting Behaviour in the UK: Evidence from the European Social Survey 2012. Auckland University of Technology, Auckland. Retrieved from: https://theaspiringeconomist.files.wordpress.com/2016/05/voting-behaviour-in-the-united-kingdom.pdf

# Smoking Behaviour – Evidence from the European Social Survey 2014/15

In today’s post I want to take a closer look at how to interpret the multinomial logistic regression output of STATA and for this I am going to use the example of smoking behaviour. In Econometrics the multinomial logit model (MLM) is used if the dependent variable on the left-hand side of the equation has several discrete alternatives as opposed to a binary variable (0=no, 1=yes). Furthermore, the independent variables on the right-hand side can be based on chooser-specific data (e.g. gender, education or income) but also choice-specific data. However, I am going to focus only on choice-specific data today. My model is therefore going to explain how the respondents’ characteristics affect their choice of an alternative among a set of alternatives.

In particular the model is going to establish which characteristics of respondents are determinants of smoking behaviour. For this I have obtained the European Social Survey 7.0 which was conducted in 2014. I am using edition 1.0 which was released on 28 October 2015. Among the aims of the ESS are monitoring changes in public attitudes as well as developing a series of European social and attitudinal indicators. The seventh round of the survey covered 22 countries and 28,221 individuals. The survey consists of an hour-long face-to-face interview with core sections as well as rotating modules. The core sections cover the socio-demographic profile of the respondents as well as things like social trust, political interest, socio-political orientations and human values. In the seventh round the two rotating modules covered (1) social inequalities in health and their determinants and (2) respondents’ attitudes towards immigration. One of the questions in the first rotating module assesses respondents’ smoking behaviour. Respondents were asked which of the following descriptions best described their smoking behaviour:

1. I smoke daily
2. I smoke but not every day
3. I don’t smoke but I used to
4. I have only smoked a few times
5. I have never smoked

This allows me to construct a detailed multinomial logit model in which the first and second answer define current smokers, the third answer equals former daily smokers and the fourth former party smokers, while the fifth answer to the question defines respondents that have never smoked.

In terms of the independent variables, I include a range of demographic control factors, namely age, gender, ethnicity, immigration and family status, education, income, employment status. In addition, I include two dummy variables for mild and significant depression as well as four dummy variables for various levels of alcohol consumption. For more information regarding the coding of the variables, please refer to the coding overview below.

Coding overview

• Smoking: never smoked=1, former party smoker (I have only smoked a few times)=2, former daily smoker (I used to smoke)=3, current smoker (I smoke daily or I smoke but not every day)=4
• Low_educ (ref.): 1 if lower secondary education or less, 0 otherwise
• Medium_educ: 1 if upper secondary education, 0 otherwise
• High_education: 1 if post-school education (vocational or tertiary), 0 otherwise
• Low_income: 1 if income in 1st – 3rd decile, 0 otherwise
• Medium_income: 1 if income in 4th – 6th decile, 0 otherwise
• High_income: 1 if income 7th – 10th decile, 0 otherwise
• Age: age in years
• Female: 1 if female, 0 otherwise
• Employed: 1 if respondent was employed in the past 7 days, 0 otherwise
• Children: 1 if children currently living at home, 0 otherwise
• Minority: 1 if respondent belongs to a minority ethnic group in country, 0 otherwise
• Immigrant: 1 if respondent was not born in the country
• No depression (ref.): 1 if respondent felt depressed none or almost none of the time in the past week, 0 otherwise
• Mild depression: 1 if respondent felt depressed some of the time in the past week, 0 otherwise
• High depression: 1 if respondent felt depressed most of the time or all/ almost all of the time in the past week, 0 otherwise
• Daily drinker: 1 if respondent consumes alcohol every day, 0 otherwise
• Frequent drinker: 1 if respondent consumes alcohol several times a week, 0 otherwise
• Weekly drinker: 1 if respondent consumes alcohol once a week, 0 otherwise
• Monthly drinker: 1 if respondent consumes alcohol 2-3 times a month or once a month, 0 otherwise
• No/Infrequent drinker (ref.): 1 if respondent consumes alcohol less than once a month or never, 0 otherwise

Summary statistics

Before proceeding to the estimation of the MLM, let’s take a quick look at the summary statistics to ensure that there are no coding errors. With the help of the – sum – command STATA produces an overview on mean, standard deviation as well as minimum and maximum for each of the variables. The table shows no anomalies except for age. When examining the outlier of 114, one can immediately see that age is likely to be a coding error due to the respondent being in paid work and not being retired. Therefore, age is recoded to missing for this observation. There are still 5 observations with an age of 100 or older. However, a closer look at their responses suggest that they are valid as all of them are retired.

In my sample 42 percent of the respondents said that they have never smoked while around 10.8 percent are former party smokers, 23.4 percent are former daily smokers and the remaining 23.8 are percent are currently smoking. In terms of education, 24.42 percent received only little education, 37.84 percent have upper secondary education (medium) and 37.74 percent have post-school education (high). Almost 30 percent of the respondents fall into the lower three income deciles (low income), Around 32 percent fall into the 4th to 6th income decile (middle income) while the remainder of the respondents (38 percent) fall into the high income category. It should be noted that the decile cut-offs vary between countries, so that these income categories will differ in what exact amount of money they represent among countries. However, the interpretation does not change, because respondents compared themselves to national standards and have low, medium or high incomes compared to the population in their respective country.

The age of respondents ranges from 14 to 104 years. The median age is 49 and therefore very close to the mean. Around 68 percent of the respondents in the sample are between 30 and 68 years old. There are slightly more females (52 percent) in the sample than males. 53 percent of the respondents said that they were employed during the last 7 days. Almost 33 percent of the respondents have children living at home, 5.65 percent belong to a minority, and 10.75 percent were born in another country. 67.74 percent of the respondents said that they felt depressed none or almost none of the time in the past week. On the other hand, 26.55 percent felt depressed some of the time (mild depression) and 5.7 percent felt depressed most of the time or all/almost all of the time (high depression). In the sample, 6.34 percent consume alcohol every day (daily drinker). 16.61 percent drink alcohol several times a week (frequent drinker), while 19.56 percent of the respondents drink alcohol once a week (weekly drinker). Monthly drinkers (2-3 times a month or once a month) are 24.26 percent of the respondents while the remainder are infrequent drinkers that either drink less than once a month or never (33.22 percent).

Regression results

I will present both the regression results in form of coefficients as well as the relative risk ratios (RRR). Let’s begin with the coefficients and a general analysis of my model.

First, it can be seen that the model includes only 22,018 observations as STATA deletes incomplete cases list-wise. Second, the Likelihood Ratio Chi-Square Statistic is 3182.55. The corresponding LR Chi-Square Test tests the assumption that the coefficients of all independent variables are jointly equal to zero. The probability of obtaining an LR Test statistic of 3182.55 or more if all coefficients were jointly equal to zero (the null hypothesis) is practically zero. It can be concluded that the model as a whole is significant.

Thereafter one can interpret the significance of the coefficients. In the first panel ‘former party smoker’ minority and mild depression are significant at the 10 percent level. Immigrant is significant at the 5 percent level and medium and high education, high income, age, female, children and all drinker dummies are significant at the 1 percent level. In the second panel ‘former daily smoker’ the variables medium and high income are significant at 5 percent level. The dummy variables on education and drinking behaviour as well as age, female, children, minority and high depression are significant at 1 percent level. In the third panel ‘current smoker’ the dummy variables on depression, drinking behaviour and income are all significant at 1 percent level. Also high education, age, female, employed and children are significant at 1 percent level.

The sign of the coefficients can be interpreted as follows: A positive coefficient indicates increased odds for the outcome 2 over 1, outcome 3 over 1, or outcome 4 over 1. A negative coefficient indicates decreased odds for the outcome 2 over 1, outcome 3 over 1, or outcome 4 over 1. The regression result always has to be interpreted relative to the base outcome, which is that the respondent has never smoked. For example, higher incomes increase the odds of being a former party smoker or former daily smoker but decreases the odds of being a current smoker compared to the odds of having never smoked, ceteris paribus. Similarly, having obtained more education is associated with an increase in the odds of being a former party smoker or former daily smoker (only medium educ) but with a decrease in the odds of being a current smoker compared to the odds of having never smoked everything else held constant. Being female reduces the odds of all outcomes compared to the odds of the base outcome, ceteris paribus. The other coefficients can be interpreted in a similar fashion. However, the numbers cannot be interpreted easily. This is why it is common to turn to relative risk ratios instead.

Relative Risk Ratios

Relative risk ratios (RRR) can be interpreted in a similar manner to odds ratios in the ordinary logit model. They are merely the exponentiated MLM coefficients from the regression output above. STATA can compute them automatically if one adds RRR or RR to the – mlogit – command:

mlogit smoking  medium_educ high_educ  medium_income high_income age female employed children minority immigrant  mild_depression high_depression  daily_drinker frequent_drinker  weekly_drinker  monthly_drinker, baseoutcome(1) rrr

Let’s start with the RRRs in the first panel. The risk of being a former party smoker vs. never smoker for respondents with medium education compared to respondents with low education is 1.20 times greater, i.e. 120 percent. Likewise the risk of being a former party smoker vs. never smoker for respondents with high education compared to respondents with low education is 1.37 times greater, i.e. 137 percent, ceteris paribus.

The risk of being a former party smoker vs. never smoker for respondents with a high income relative to respondents with a low income is about 18.85 percent higher, holding everything else constant. The risk of being a former party smoker vs. never smoker falls by about 1.7 percent for each additional year of age, all else being equal. The risk of being a former party smoker vs. never smoker is 21.71 percent lower for females relative to males, when everything else is held constant. The risk of being a former party smoker vs. never smoker is 14.63 percent lower for respondents living with children compared to respondents that do not. The risk of being a former party smoker vs. never smoker is 20.59 percent lower for members of a minority ethnic group compared to non-minorities, ceteris paribus. Similarly, the risk is 17.01 percent lower for immigrants compared to non-immigrants.

Having a mild depression increases the risk of being a former party smoker vs. never smoker by about 10.25 percent compared to no depression. Lastly, the risk of being a former party smoker vs. never smoker is 106.34 percent higher for daily drinkers, 105.48 percent higher for frequent drinkers, 88.62 percent higher for weekly drinkers and 75 percent higher for monthly drinkers compared to less frequent drinkers, ceteris paribus. Hence, drinking alcohol more frequently increases the odds of being a former party smoker over being a never smoker significantly and shows that alcohol and cigarette consumption tend to go together. It does not have to be a causal relationship but one could argue that social drinking induces individuals to at least try smoking once in their lives.

The RRRs in the second panel can be interpreted as follows. Having obtained a medium level of education compared to low education levels increases the odds of being a former daily smoker vs. never smoker by 14.22 percent, ceteris paribus. Likewise, earning a medium income or a high income increases the odds in favour of former daily smoking over never have smoked by 11.64 percent and 12.96 percent, respectively. An additional year of age increases the odds of being a former daily smoker instead of having never smoked by 1.85 percent, holding everything else constant. The odds of being a former daily smoker vs. never smoker is 38.95 percent lower for females compared to males. The risk of being a former daily smoker rather than a never smoker increases by 15.43 percent if the respondent has children living at home compared to respondents without children at home. The risk of being a former daily smoker vs. never smoker is 36.24 percent lower for minorities compared to non-minorities. High levels of depression increase the odds in favour of being a former daily smoker relative to never having smoked by 25.55 percent. The risk of being a former daily smoker vs. never smoker is 213.84 percent higher for daily drinkers, 158.73 percent higher for frequent drinkers, 80.64 percent higher for weekly drinkers and 59.67 percent higher for monthly drinkers compared to less frequent drinkers, ceteris paribus. Again, drinking alcohol more frequently increases the odds of being a former daily smoker over being a never smoker significantly and confirms the view that alcohol and cigarette consumption tend to go together. Respondents that drink more than once a week are predicted to be former daily smokers rather than never smokers.

In the third panel the RRRs can be interpreted as follows. Having obtained a high level of education compared to low education levels decreases the odds of being a current smoker vs. never smoker by 45.84 percent, ceteris paribus. Likewise, earning a medium income or a high income decreases the odds in favour of currently smoking over never having smoked by 23.78 percent and 41.65 percent, respectively. An additional year of age decreases the odds of currently smoking over having never smoked by 1.82 percent, holding everything else constant. The odds of being a current vs. never smoker is 32.96 percent lower for females compared to males. The risk of being a current smoker vs. never smoker is 38.27 percent higher for respondents currently employed to respondents not currently employed.

Medium levels of depression compared to no depression increase the odds in favour of being a current smoker relative to never having smoked by 31.49 percent. High levels of depression increase the odds in favour of being a current smoker relative to never having smoked by 122.13 percent. The risk of being a current smoker vs. never smoker is 289.93 percent higher for daily drinkers, 166.62 percent higher for frequent drinkers, 82.29 percent higher for weekly drinkers and 41.33 percent higher for monthly drinkers compared to less frequent drinkers, ceteris paribus. Again this confirms that drinking alcohol more frequently increases the odds of being a former daily smoker over being a never smoker significantly and confirms also that alcohol and cigarette consumption tend to go together. Respondents that drink more than once a week are predicted to be current smokers rather than never smokers.

The risk of being a current smoker rather than never smoker increases by 11.90 percent if the respondent has children living at home compared to respondents without children at home. This does not infer a causal relationship in the sense that children cause people to smoke. However, this finding is troublesome in the sense that the RRR was expected to be negative, i.e. that having children at home decrease the odds in favour of being a current smoker over never smoker. The command – adjrr children – can be used to shed light on the relationship between having children living at home or not and the smoking outcomes. Respondents with children at home are 4.17 percent less likely to be a never smoker than respondents without children at home. This group is also 18.08 percent less likely to be a former party smoker compared to respondents not living with children. However, this group is 9.57 percent more likely to be former daily smokers compared to respondents not living with children at home and this group is also 6.55 percent more likely to be current smokers compared to respondents not living with children at home. In terms of absolute differences, respondents with children at home are 1.52 percentage points more likely to be current smokers than respondents without children at home, on average. They are also 2.33 percentage points more likely to be former daily smokers, on average.

Measures of fit

After having described the findings the – fitstat – command can shed light on the overall goodness of fit.

For example, the adjusted count R-squared measures the proportion of correct predictions beyond the baseline model (IDRE, 2011). It shows that the percentage of correct predictions beyond this baseline model is 8.6 percent. Hence, while my variables turn out to be significant at the margin, the overall decision to smoke or having tried smoking remains still largely random and is not captured in the model. There might be other factors that could do a better job and should be included in the model.

Marginal effects

STATA allows for the computation of marginal effects with the help of the – margins – command. Marginal effects differ for discrete and continuous variable where the former are discrete changes, i.e. from 0 to 1, and the latter are instantaneous rates of change. Marginal effects are commonly calculated at the means of the independent variables. Therefore STATA first presents all means before printing the results.

First of all it can be noted that only age is a continuous variable. All other variables are binary and take only a value of 0 or 1. The marginal effects for all those variables therefore show how P(Y=1) changes as these independent binary variables change from 0 to 1 while all other variables are held constant at their means (Williams, 2016). For example, the predicted probability of being a current smoker compared to never having smoked is 0.158 greater for daily drinkers and 0.102 greater for frequent drinker if you take two hypothetical respondents evaluated at the means. Another example is education. For two hypothetical respondents evaluated at the means of the sample, having obtained high education reduces the probability of being a current smoker, i.e. the predicted probability of being a current smoker is 0.118 smaller for individuals in the high education group compared to individuals in the low education group. In contrast, the negative effect of secondary education is a lot smaller and less significant.

Regression diagnostics: I. Multicollinearity

The model can be tested for collinearity with the – collin – command, which would cause standard errors to be inflated.

There are different rules of thumb for detecting multicollinearity. The most rigorous is probably a Variance Inflating Factor (VIF) of greater than 2 and therefore a tolerance of lower than 0.5 (1/VIF). However, my model does not suffer from inflated standard errors and a mean VIF of 1.37 is pretty good.

II. Tests of independent variables

The – mlogtest – command allows for testing for independent variables. There is the option for a likelihood ratio test (lr) as well as a wald test (wald). Both test the null hypothesis whether all coefficients associated with the given variables are in fact zero (Williams, 2015).

Both tests reject the null hypothesis for all variables at the 1 percent level except for the immigrant variable. For this one the null hypothesis can be rejected at the 5 percent level. Hence each variable’s effects are highly significant in the model.

III. Tests for combining dependent categories

The – mlogtest – command also allows for testing whether the categories of the dependent variable should in fact be combined. Again, there is the option for a Likelihood-Ratio test (lrcomb) as well as a Wald test (combine). The null hypothesis is that all coefficients except intercepts associated with a given pair of alternatives are in fact zero, meaning that the alternatives can be collapsed for a more efficient estimation (Williams, 2015).

Overall, both the LR and the Wald Test confirm that none of the categories should be combined. They are significant at the 1 percent level. It can be concluded that the outcomes are distinguishable with respect to the variables included in the model.

IV. Tests for independence of irrelevant alternatives

Lastly, the – mlogtest – command can test the independence of irrelevant alternatives (IIA) assumption which is crucial for the multinomial logit model. If violated one can revert to an alternative specific multinomial probit or a nested logit model. Both relax the IIA assumption (IDRE, 2010). The test for IIA is either based on a Hausman test, a suest-based Hausman test or a Small-Hsiao test. All of the three tests work in a similar manner; for each alternative in the model they drop the individuals that choose that particular alternative and then re-estimate the model with the alternatives that remain (Allison, 2012). Because I have 3 alternatives in my model (beyond the base outcome of never having smoked), the tests proceed in three steps. They first drop being a former party smoker (fparty), then drop being a former daily smoker (fdaily) and lastly drop being a current smoker (current). If the IIA assumption were to hold, the results of the restricted model (2 alternatives) should not differ from the unrestricted model (3 alternatives).

It should be noted that the tests have been criticized, because they are typically inconclusive or even contradictory. In case you want more information on this, Peter Allison (2012) has devoted a complete blog post to the drawbacks of the three tests. One of the major criticisms is that the Small-Hsiao test results in different outcomes every time because it splits the sample into two halves and also the Hausman test results in different outcomes if one changes the base category (Sarkisian, n.d.). This is why it is often recommended to instead focus on the Hausman test which uses seemingly unrelated estimation (SUE) as methodology (Long and Freese, 2005).

In STATA one can obtain the three tests with the command – mlogtest, iia. Firstly, the Hausman test does not provide me with anything because the Chi2<0 and therefore my model does not meet the asymptotic assumptions of the test. Second, the suest-based Hausman test provides strong evidence against independence of irrelevant alternatives in the sample. It rejects the null hyptothesis at the 1 percent level. However the third test, i.e. the Small-Hsiao test of the IIA assumption, cannot reject the null hypothesis that the odds are independent of other alternatives. It contradicts the results of the suest-based Hausman test. As noted earlier, this is in line with the major criticisms toward IIA testing. To ensure that the violation of the IIA assumption does not interfere with my results, I should consider running an alternative specific multinomial probit or a nested logit model. However, I’ll leave this for another blog post in the future!

Jasse

Data and Documentation

ESS Round 7: European Social Survey (2015): ESS-7 2014 Documentation Report. Edition 1.0. Bergen, European Social Survey Data Archive, Norwegian Social Science Data Services for ESS ERIC.

ESS Round 7: European Social Survey Round 7 Data (2014). Data file edition 1.0. Norwegian Social Science Data Services, Norway – Data Archive and distributor of ESS data for ESS ERIC.

Inspiration for the Model

Brown, D.C. (n.d.). Models for Ordered and Unordered Categorical Variables [pdf]. Population Research Center. Retrieved from: https://www.utexas.edu/cola/prc/_files/cs/Multinomial_Ordinal_Models.pdf

References

Allison, P. (2012, 8 October). How Relevant is the Independence of Irrelevant Alternatives? Statistical Horizons. Retrieved from: http://statisticalhorizons.com/iia

IDRE (23 April, 2010). Stata Data Analysis Examples: Multinomial Logistic Regression. Institute for Digital Research and Education. Retrieved from: http://www.ats.ucla.edu/stat/stata/dae/mlogit.htm

IDRE (2011, 20 October). FAQ: What are pseudo R-squareds? Statistical Horizons. Retrieved from: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

Long, J., and Freese, J. (2005). Regression Models For Categorical Dependent Variables Using Stata (2nd ed.). College Station, TX: Stata Press.

Sarkisian, N. (n.d.). Sociology 704: Topics in Multivariate Statistics – Multinomial Logit [pdf]. Retreived from: http://www.sarkisian.net/sc704/mlogit.pdf

Williams, R. (2015, 21 February). Post-Estimation Commands for MLogit [pdf]. Retrieved from: https://www3.nd.edu/~rwilliam/stats3/Mlogit2.pdf

Williams, R. (2016, 23 January). Marginal Effects for Continuous Variables [pdf] Retrieved from: http://www3.nd.edu/~rwilliam/xsoc73994/Margins02.pdf

# What Makes A Musician? Econometrics

Today my post is about econometrics; in particular gettting a handle on the Logit model. I took it as an opportunity to investigate why some people have come to play a musical instrument and others haven’t. In order to do so I needed to find a dataset which also includes questions about an individual’s free time activities, i.e. a survey that asks respondents whether they play an instrument for leisure. This is why I obtained the National Survey of Culture, Leisure and Sport 2014-2015 from the UK Data Service. The survey is carried out by the Department of Culture, Media and Sport and its partner organisations Sport England, English Heritage and Arts Council England. Since the 2012/13 survey, the study also includes longitudinal elements and for the 2014/15 survey the target was to achieve a sample size of 10,000 respondents, equally split between longitudinal and new respondents. This study asks people a large range of questions concerning their cultural activities, events they participate in, as well as hobbies and sports. It has also introduced a section about what people were doing when they were growing up recently, which will come handy for constructing a variable on whether a person played an instrument during childhood.

Summary statistics

Let’s start with an overview on the data. The dataset includes 9,817 observations of which 5,480 are female (55.82 percent). Regarding the demographics, the mean age in the sample is 53 years while the median lies at 54 years. The minimum age is 16 (91 observations) and the maximum age is 100 (1 observation). The standard deviation is around 18.53, meaning that around 68 percent of the observations are between 35 and 72 years old. More than half of the respondents are married, 20 percent are single, while around 12 percent are either widowed or divorced. Around 5.5 percent of the respondents are lone parents with live-in children and around 18.4 percent of the respondents have live-in children and currently live with a partner. Almost 32 percent of the respondents have obtained higher education and professional/vocational equivalents as their highest qualification level, meaning that the other 5,409 respondents have obtained qualifications lower than this. In addition, 48.57 percent of the respondents are currently in paid work. The remainder are for example unemployed, students, retired, sick people, people looking after family, or people in training schemes.

In their free time, 914 out of the 9,817 respondents play a musical instrument. This is equal to 9.31 percent of the sample. Besides, 171 respondents (1.74 percent) have written music in the 12 months preceding the questionnaire while the remainder of 9,646 have not. Also around 12 percent of the observations have participated in painting, drawing, printmaking or sculpture in the 12 months preceding the interview. Overwhelmingly, almost 66 percent of the respondents said that they have read for pleasure in the last 12 months, which excludes newspapers, magazines or comics. In terms of music making, 427 of the respondents have done singing as performance or rehearsal/ practice in the last 12 months, which is equal to 4.35 percent of the sample. 2,206 respondents played a musical instrument, acted, danced or sang when they were growing up. Growing up is defined as the period from around age 11 to age 15. On the other hand, 2,621 of the respondents did not participate in such activities. Another 4,990 observations were not asked this question in their longitudinal questionnaire. Hence this will reduce the sample to an effective size of 4,827 observations at most later on.

The Logit Model

This leaves us with 11 dependent regressors and the independent dummy variable of playing an instrument (1=yes) for the Logit model. 9 of the dependent regressors are binary themselves. In addition, age is measured in 4 categories with the 16-24 age category as reference group. This will show whether people are less likely to play an instrument when getting older compared to the youngest generation. Parental status has 3 categories with no live-in children as reference category. It assesses whether people with children are less likely to play an instrument due to time constraints, especially when they are lone parents.

Coding Overview

 Instrument 1 if playing an instrument Gender 1 if female Age 16-24 (ref.) 25-44 45-64 >65 Marital status 1 if (de facto) married Parental status No live-in children (ref.) Lone parent with live-in children Partnered with live-in children Education 1 if higher education Work 1 if currently in paid work Written music 1 if written music in the last 12 months Read books 1 if read for pleasure in the last 12 months Painting 1 if participated in painting, drawing, printmaking or sculpture in the last 12 months Singing 1 if sang in front of an audience or practiced singing in the last 12 months Childhood instrument 1 if played a music instrument, acted, danced or sang when growing up (age 11-15)

Regression Results

The table below summarizes the initial regression results obtained by estimating a Logit model in STATA, where one, two and three asterisks indicate significance at 10, 5 and 1 percent level, respectively.

 Logit Model 1 Logit Model 2 Playing an instrument OR STDE OR STDE female 0.34 0.04 *** 0.34 0.04 *** age 25-44 0.67 0.14 * 0.66 0.14 * age 45-64 0.78 0.16 0.77 0.16 age above 65 0.47 0.11 *** 0.48 0.11 *** married 0.87 0.12 0.89 0.13 lone parent with children 0.70 0.19 0.71 0.19 partnered with children 0.79 0.15 0.80 0.15 higher education 1.52 0.19 *** 1.58 0.20 *** paid work 0.83 0.11 0.82 0.11 read books 1.37 0.20 ** 1.39 0.20 ** painting 1.81 0.26 *** 1.76 0.26 *** singing 4.36 0.85 *** 4.22 0.83 *** write music 29.63 12.13 *** 29.47 12.09 *** singing*writing music 0.16 0.09 *** 0.16 0.10 *** instrument as child 5.25 0.75 *** 5.35 0.77 *** _cons 0.06 0.01 *** 0.06 0.01 *** Number of obs 3912 3827 LR chi2(15) 608.32 608.39 Prob > chi2 0.00 0 Pseudo R2 0.22 0.23

In my first model 10 regressors turn out to be significant at least at 10 percent level, of which 8 are highly significant at 1 percent level. The significant variables and their odds ratios can be interpreted as follows: In the sample females are 197.47 percent less likely to play an instrument relative to males, ceteris paribus. People in the age group 25 to 44 are 49.65 percent less likely to play an instrument compared to people aged 16 to 24 everything else being equal. People aged above 65 are even 113.83 percent less likely to play an instrument relative to people aged 16 to 24, while the age group 45 to 64 is not statistically significant. This penalty for the age group 25 to 44 might well be explained by busy schedules due to work and family commitments while the age group 45 to 64 regains more flexibility for example once children left the household. The penalty for people above 65 might derive from deteriorating health conditions, for example eyesight to read music or hearing loss. While being in paid work does not turn out to be significant, higher education does have a positive impact, i.e. people with the highest level of education are 51.76 percent more likely to play an instrument relative to their peers with lower education levels.

The next significant set of variables are other leisure activities as predictors for playing an instrument. People that read books for leisure are 37 percent more likely to play an instrument relative to people that do not read for pleasure. People that paint etc. are 81 percent more likely to play an instrument relative to people that do not. People that sing are 336 percent more likely to play an instrument, highlighting that this is a strong predictor for playing an instrument. This is a rather intuitive finding as people with a talent or interest for singing are more likely to be interested or talented in playing an instrument as well (musicianship/ musical ability). Even more influential are the variables having played an instrument, acted, danced or sung as a child as well as currently writing music as a hobby. They are very strong predictors for playing an instrument in the model.

Overall this reveals an important trend: people tend to learn an instrument mainly during childhood. They are probably either enrolled by their parents or themselves wish to learn an instrument. They then either continue this hobby in later life or drop it at some point. This seems to be the main path to learn an instrument and in later life there is more something like a demographic ‘penalty’, especially age, which reduces the probability of playing an instrument rather than incentives for adults to acquire new skills and develop their musical ability. What strikes me though is the large gender gap in the dataset after controlling for other demographic influences like parental status.

Goodness of fit

Overall the Log Likelihood Chi-square statistic with 15 degrees of freedom is 608.32 and its p-value is practically zero, meaning that we can reject the null model, which would always predict 0 (no), in favour of my model. It can be concluded that my model as a whole is statistically significant. Likewise the Homer-Lemeshow’s goodness-of-fit test accepts that the model fits the data with a Hosmer-Lemeshow Chi-square statistic of 6.97 with 8 degrees of freedom and a p-value of 0.5401. One can also take the Count R-squared and the adjusted Count R-squared into account. The former is at 0.903 while the latter is 0.121. The adjusted Count R-squared gives the proportion of correct predictions beyond the baseline model of always predicting 0. The estimated model therefore makes 12.1 percent more correct predictions. As the dataset contains a large number of non-musicians of above 90 percent interpretations of the pseudo R-squared statistics should be treated with caution and I will therefore focus on what determines playing an instrument at the margin.

 Measures of Fit Model 1 Log-Lik Intercept Only: -1356.989 Log-Lik Full Model: -1052.83 D(3896): 2105.66 LR(15): 608.318 Prob > LR: 0 McFadden’s R2: 0.224 McFadden’s Adj R2: 0.212 ML (Cox-Snell) R2: 0.144 Cragg-Uhler (Nagelkerke) R2: 0.288 McKelvey & Zavoina’s R2: 0.328 Efron’s R2: 0.212 Variance of y*: 4.893 Variance of error: 3.29 Count R2: 0.903 Adj Count R2: 0.121 AIC: 0.546 AIC*n: 2137.66 BIC: -30121.288 BIC’: -484.241 BIC used by Stata: 2238.009 AIC used by Stata: 2137.66

Misspecification errors

The next step is to test my model on specification errors. It could be that the relationship is not linear or that I missed out on either a relevant regressor or a linear combination of my regressors. Note that singing and writing music is positively correlated (a person that writes music is more likely to sing as well). Therefore I already included an interaction term in case one sings and writes music in my model to avoid misspecification errors in these regards. The Linktest shows that while the _hat value is statistically significant at the 1 percent level, the _hatsq is not significant with a p-value of 0.325. It can be concluded that my model includes all relevant regressors and is correctly specified.

Multicollinearity

My model is likely to suffer from multicollinearity as at least writing music, reading books, painting and singing are similar in their nature, i.e. artistic or cultural leisure activities. This could be a source for severe multicollinearity and inflate my standard errors misleading one to conclude that regressors are in significant when they in fact need to be included. The tolerance of all regressors is greater than 0.1 which is the threshold as a rule of thumb below which one should have concerns about multicollinearity. Likewise, the variance inflating factors (VIF) are all less than 10 with a mean VIF of 1.77, which is pretty good. The highest VIFs have the age category dummies of 3.54, 3.27 and 2.92 respectively. Therefore my model’s standard errors are sufficiently robust and not significantly inflated.

Influential observations

To determine whether there are influential observations due to coding errors or other issues as well as plainly legitimate outliers which might be of interest for further study, one can use a plot of the standardized Pearson and Deviance residuals as well as leverage.

The highest outlier, as shown best in the Pearson residual index plot, is observation number 9329. Looking at the data one can see that this female is in the age range 25 to 44 and a lone parent with live-in children. She is currently in paid work and does painting in her leisure time. However, she did not play an instrument, acted, danced or sang during childhood. The model predicts that she is not playing an instrument (p=0.04) when she in fact now does so. Therefore this respondent probably started playing an instrument at a later age (after 15) despite time constraints regarding family and work and is therefore a significant positive outlier. The lowest outlier is observation number 9321. This person is between age 16 and 24, currently in paid work and also played an instrument, acted, danced or sang during childhood. This male currently writes music in his leisure time and has also done painting in the last 12 months. The model strongly predicts that this respondent plays an instrument with a probability of more than 0.93. However, this person in fact does not play an instrument. Another interesting outlier is observation number 7641. This female currently plays an instrument despite being a lone parent with live-in children. She is in the age range of 16 to 24, in paid work and has played an instrument, acted, danced or sang during childhood. The model predicts that this respondent does not play an instrument due to her time constraints (p=0.06) when she in fact does so. The last interesting observation number I want to discuss is 6314. This male is in the age range of 45 to 64 and in paid work. He is currently married but not a lone parent or partnered with live-in children. The respondent does singing and painting in his leisure time, reads books and also played an instrument, acted, danced or sang during childhood. The model predicts that this respondent plays an instrument with a probability of more than 0.90 when in fact he does not.

When using the rule of thumb that a leverage of three times the average leverage (0.044) is a threshold for influential observations, i.e. a value of greater than 0.132, the model includes 85 influential observations and 3,827 non-influential observations. However, when excluding the former in a second model, the significance of the regressors does not change. The odds ratios of some regressors do change to a small extent but the main results are the same.

Conclusions

Today’s exercise was all about using the Logit model in practice while shedding light on what might be a determinant of playing an instrument. First and foremost, it is driven by acquiring the skills during childhood (age 11 to 15) as well as currently writing music. This is followed by singing. There is an age penalty; as people become older they are less likely to play. Females are significantly less likely to play an instrument in the dataset. This might well derive from family and household commitments which females tend to pursue more often than males leaving them less free time to be allocated to their own hobbies, but that is my own interpretation.

Thanks for reading! I hope you enjoyed the exercise,

Jasse

Department for Culture, Media and Sport. (2016). Taking Part: the National Survey of Culture, Leisure and Sport, 2014-2015; Adult and Child Data. [data collection]. UK Data Service. SN: 7872, http://dx.doi.org/10.5255/UKDA-SN-7872-1.