Replication Exercise

ECONOMICS 191

Fall 2018

Due: October 24

In this exercise, you will demonstrate a basic understanding of how to use econometrics

in practice with software such as Stata or R. We will walk you through the commands for

Stata, but if you prefer to use R, feel free to do so.

We will use a random sample of the 1988 National Maternal and Infant Health Survey,

collected by the U.S. Department of Health and Human Services, and presented in Jeff

Wooldridge’s Introductory Econometrics textbook, 3rd edition. This survey was one of

the first large-scale data collections to examine the correlation between maternal smoking

and infant weight at birth. Though the research design cannot be interpreted as causal,

the large correlations between maternal smoking and low infant birth among women

between the ages of 15 and 49 who had a pregnancy in 1988 spurred a tidal wave of

literature on the subject that continues to progress.

Please hand in, via bCourses, your log file, do file, and a concise write-up of

your findings, answering all questions below.

Note that many of the commands that we ask you to run in Stata are also available in

Stata’s drop down menu. Please do not use it. Instead, write each command in your do file.

1 Setup

1. Download the data set birth weight_sample.dta from the “Replication Exercise” folder

under the “Files” tab on bcourses. (This will not show up on your Stata log or your

do file.)

2. Open Stata and create a new do file by clicking on the “New Do File Editor” Icon.

3. Open birth weight_sample.dta in Stata using the “use” command.

2 Descriptive Statistics

1. Locate the data entry error in the data set. Briefly describe the error in your

write-up. (Stata hint: use the summarize command sum to search for the error.)

2. Replace the incorrect data entry with the missing entry symbol. (Stata hint: missing

values are coded with a . in Stata. Use replace variable name = . if

condition )

3. Report the median, mean, standard deviation, maximum and minimum of each variable

in the data set. (Stata hint: sum, detail)

4. Graph histograms for our two main variables of interest, bwght and cigs. (Stata hint:

histogram). Briefly describe your findings.

1

3 A Simple Regression

1. Run a univariate linear regression of the variable bwght on the variable cigs. In your

write-up, interpret the coefficient on the right-hand side variable as it relates to bwght

and indicate whether the coefficient is statistically significant at the 95% level. (Stata

hint: regress bwght cigs, robust)

2. Make a publication-quality table for the above regression. Attach this table to your

write-up. (Stata: use the command ssc install outreg2 to install the outreg2

package. Use the command outreg2 using “~/olstable.xls”, excel replace to

make a table displaying the results from the previous regression, where “~” should be

replaced by the location on your computer’s directory where you would like the exported

.xls file to be placed, e.g. “/Academics/Econ 191/Replication Exercise”.)

3. Create a scatter plot of all observations’ values of bwght and cigs. Title the graph

“Scatter Plot of Birth Weight (oz.) on Maternal Smoking.” Label the horizontal and

vertical axes. (Stata hint: twoway scatter with options ytitle, xtitle, and title).

4. Add a best-fit prediction line to your scatter plot. Again, title the graph and the two

axes. (Stata hint: twoway (scatter bwght cigs) (lfit bwght cigs) )

5. In your write-up, describe the relationship you see between the two variables. Does

the best fit line explain anything that was not apparent in the simple scatter plot?

Explain. Finally, save the graph with best fit line and include it in your write-up.

(Stata hint: graph export)

4 More Regressions

An obvious concern with the previous regression is the possibility that there are other

factors which are correlated with both birth weight and cigarette consumption. This could

lead to what is called omitted variable bias, and a spurious correlation between birth weight

and cigarette consumption. For example, it could be the case that mother’s education is

correlated both with birth weight and cigarette consumption. One way to address this issue

is to add control variables to the regression.

1. Run a multiple linear regression of the variable bwght on the variables cigs, motheduc,

fatheduc, and parity. In your write-up, interpret the coefficient on each righthand

side variable as they relate to bwght and indicate whether each coefficient is

statistically significant at the 95% level. Append the results of this regression to your

table. (Stata hint: outreg2 using “~/olstable.xls”, excel append )

2. Add the dummy variables male and white to regression you ran in 1. Interpret the

coefficients of this new regression in your write-up, again indicating whether each

coefficient is statistically significant at the 95% level. Did anything change from the

results you found above? Append the results of this regression to your table.

3. Create a new variable cigssq, the square of cigs. Add the new variable ciggsq to

the regression you ran in 1. Interpret the coefficients of this new regression in your

write-up, again indicating whether each coefficient is statistically significant at the 95%

level. Did anything change from the results you found above? Append the results of

this regression to the your table. (Stata hint: gen cigssq = cigs^2)

2

4. Note that the dataset contains a log version of the left-hand side variable bwght:

lbwght. Rerun your regression from 1 with lbwght as the left-hand side variable.

Interpret and assess the statistical significance of each coefficient in this regression in

your write-up. Append the results of this regression to the your table.

5. Finally, add state fixed effects to the regression you ran in 1 using the variable state.

What type of variation does this control for? Interpret your results and append

the results of this regression to your table (without including the coefficients for

the state dummies). (Stata hint: areg bwght cigs motheduc fatheduc parity,

absorb(state)).

5 Instrumental Variables

So far, we have not been able to infer a causal relationship between cigarette smoking

and birth weight. Instrumental variables are sometimes useful to infer causal relationships.

Suppose that we believe that mother’s education is an appropriate instrument for cigarette

smoking, our endogenous independent variable.

1. Run a regression of cigarette smoking on mother’s education to verify that there is a

significant correlation between our instrument and our endogenous variable. This is

called the first-stage regression.

2. Run the two-stage-least-squares regression using mother’s education as an instrument

for cigarette smoking, with birth weight as the dependent variable. (Stata hint: ivreg

bwght (cigs=motheduc)). Interpret your result.

3. We have used mother’s education as an instrument for cigarette smoking. However,

this is a very poor choice for an instrument! Explain why. Can you think of a more

suitable instrument?