Define “collectively exhaustive.’’
Define “mutually exclusive.’’
What are the rules a probability measure needs to satisfy?
Define “Probability mass function.’’
Define “Cumulative probability function.’’
What is the statement made by the “Law of Large Numbers?’’
What is the statement made by the “Central Limit Theorem?’’
Name outlier-robust statistics that characterizes central tendency and spread of a variable (one each).
Define “heteroskedasticity.’’
Define “omitted variable bias.’’
Define “causal inference.’’
What is the fundamental problem of causality?
Problem set 2
Consider an experiment where you throw two four-sided dice. Let be the event “The first die rolls an n” and be “The second die rolls an n.” Determine whether each of the following lists of events (events are separated by “;” in the lists below) are mutually exclusive, collectively exhaustive, a sample space or an event space. Which set of descriptors apply to each? All descriptors might apply, just some, or none at all.
; ;
; ; ; ;
; ;
; ; ;
; ; ;
Problem set 3
Assume that you have a dataset with several variables. You estimate an OLS regression of the form . What assumptions are necessary to ensure that the coefficient estimates and of and , respectively, are unbiased? Name each assumption and give an example of a violation of each assumption.
Problem set 4
Load the data set “gb_recoded.dta’’ from Moodle. It holds data from a survey we conducted in Great Britain in 2008. Each row represents the answers of one respondent to a long series of questions as well as this respondent’s social characteristics and some information about the interview from the survey firm. Look at the data and consider the following variables:
partisanship (variable )
opinion on a legislative measure proposed to increase the legal drinking age for wines and spirits to 21, in order to reduce problems of teenage binge drinking (variable )
turnout in the 2005 UK parliamentary election (variable )
Provide appropriate summary statistics and plots for the variables , , and .
Now, we want to learn about how the individual-level characteristics from above relate to public opinion (aka as your dependent variable). Create a reasonable model of public opinion as a function of the variables given above (make sure you include at least age as independent variable, gonna need that for 2). Run a linear regression and interpret the outcome of that regression.
Test the hypothesis: “Age does not have an effect on public opinion about a measure to increase the drinking age.’’ Give a full report on the outcome of that hypothesis test and interpret that outcome.
Problem set 5
State a research question of your choosing. Sketch a data generating process (DGP) that describes the relationships between outcome variable (Y), main explanatory variable (T), observable variables (X), and unobservable variables (U) speaking to your research question. That is, define which variables would you fill in for Y, T, X, and U to answer your research question. This can be the research question you are working on already and which you have referenced in the problem sets. Then, represent their relationship graphically, that is, indicate pattern and direction of influence by arrows connecting Y, T, X, and U.
Suppose the attempt to get an unbiased estimate of the relationship between T and Y is affected by the presence of omitted variable bias. Insert an omitted variable O into your sketch of the DGP. How would omitted variable bias enter a simple linear regression model? Provide a mathematical expression of that regression model with omitted variable bias.
Which assumption needs to be met so that our estimate from a regression of the effect of T on Y becomes the estimate of a causal effect? Explain what this assumption means either in plain English or using proper formalisations. How is this assumption related to omitted variable bias? Which problem for making causal claims is solved when this assumption is met?
What kind of manipulation to the DGP would be necessary to trivially meet the assumption discussed in 3?
Problem set 6
Say a survey researcher, who was not educated at the University of Essex’s Government department, wants to learn about UK citizens’ concerns with immigration.
This researcher asks a representative sample of UK citizens whether they would be very angry, somewhat angry, indifferent, not angry, not at all angry about an Eastern European immigrant moving in next door. Is this a good measure of prejudice towards EU immigrants? Why is it? Why is it not?
From the list of tools you learned in class, what would you advise the researcher to do in his/her survey instead to get at an unbiased measure of UK citizens’ prejudice? Name the tool and explain in detail how it is used to measure the level of prejudice towards EU immigrants.
Problem set 7
Create a 1000 observation dataset. Generate variables RootCause and OtherThing as independent, uncorrelated variables each drawn from a normal distribution with mean 0 and variance 1. Create a set of normal error terms with mean 0 and variance 1. Let Outcome = 1 + RootCause + 3*OtherThing + errors.
Draw a graphical representation of the data generating process (DGP) involving the variables Outcome, RootCause, and OtherThing, that is, show by drawing arrows how these three variables relate to each other in the data you generated. Are RootCause and OtherThing independent? If you think they are independent, how would you represent graphically that they are independent in your DGP? If you think they are not independent, how would you represent graphically that they are not independent in your DGP?
Regress Outcome on RootCause. Report and interpret the result. Did you estimate the causal effect of RootCause on Outcome with this regression? Did you estimate the causal effect of RootCause on Outcome with this regression? If you think you were able to estimate this causal effect, why do you think you were able to do so with this regression? If you think you were not* able to estimate this causal effect, why do you think you were not able to do so with this regression?
Regress Outcome on RootCause and OtherThing. Report and interpret the result. Did you estimate the causal effect of RootCause on Outcome? If you think you were able to estimate this causal effect, why do you think you were able to do so with this regression? If you think you were not able to estimate this causal effect, why do you think you were not able to do so with this regression?
Compare the results of the regressions you ran in 2 and 3. What do you see when you compare the coefficients on RootCause estimated in those two regressions? Why do we see those results in 2 and 3?
Problem set 8
Clear your work space and create a new data set with 1000 observations. Generate variable RootCause following a normal distribution with mean 0 and variance 1. Generate variable OtherThing = 2RootCause + noise where noise follows a normal distribution with mean 0 and variance 1. Create a set of normal error terms with mean 0 and variance 1. Let Outcome = 1 + RootCause + 3OtherThing + errors.
Draw a graphical representation of the data generating process (DGP) involving the variables Outcome, RootCause, and OtherThing, that is, show by drawing arrows how these three variables relate to each other in the data you generated.
Regress Outcome on RootCause. Report and interpret the result. Did you estimate the causal effect of RootCause on Outcome with this regression? If you think you were able to estimate this causal effect, why do you think you were able to do so with this regression? If you think you were not able to estimate this causal effect, why do you think you were not able to do so with this regression?
Regress Outcome on RootCause and OtherThing. Report and interpret the result. Did you estimate the causal effect of RootCause on Outcome? If you think you were able to estimate this causal effect, why do you think you were able to do so with this regression? If you think you were not able to estimate this causal effect, why do you think you were not able to do so with this regression?
Compare the results of the regressions you ran in 2 and 3. What do you see when you compare the coefficients on RootCause estimated in those two regressions? Why do we see those results in 2 and 3? In your answer, explain how those results are related to the question whether the regression of Outcome on RootCause and OtherThing meets the conditional independence assumption.