Stata Modeling & Graphing

Topics

Stata modeling
- Simple regression
- Multiple regression
- Interactions
- Exporting regression tables
- Testing model assumptions
Stata graphing
- Univariate graphs
- Bivariate graphs

Setup

Software & Materials

Laptop users: you will need a copy of Stata installed on your machine. Harvard FAS affiliates can install a licensed version from http://downloads.fas.harvard.edu/download

Download class materials at https://github.com/IQSS/dss-workshops/raw/master/Stata/StataModGraph.zip
Extract materials from the zipped directory StataModGraph.zip (Right-click => Extract All on Windows, double-click on Mac) and move them to your desktop!

Organization

Please feel free to ask questions at any point if they are relevant to the current topic (or if you are lost!)
Collaboration is encouraged - please introduce yourself to your neighbors!
If you are using a laptop, you will need to adjust file paths accordingly
Make comments in your Do-file - save on flash drive or email to yourself

Goals

This is an introduction to modeling and visualization in Stata
Assumes basic knowledge of Stata
Not appropriate for people already familiar with Stata
If you are catching on before the rest of the class, experiment with command features described in help files
Learning Objectives:
- Fit models in Stata
- Test modeling assumptions
- Plot basic graphs in Stata
- Plot two-way graphs

Fitting models

Today’s Dataset

We have data on a variety of variables for all 50 states
Population, density, energy use, voting tendencies, graduation rates, income, etc.
We’re going to be predicting SAT scores
Univariate Regression: SAT scores and Education Expenditures
Does the amount of money spent on education affect the mean SAT score in a state?
Dependent variable: csat
Independent variable: expense

Opening Files

Look at bottom left hand corner of Stata screen
- This is the directory Stata is currently reading from
Files are located in the StataStatistics folder on the Desktop
Start by telling Stata where to look for these

  // change directory
  cd "~/Desktop/Stata/StataStatGraph"

set more off

cd "~/Desktop/Stata/StataStatGraph"
/nfs/www/edu-harvard-iq-tutorials/Stata/StataStatGraph

Use dir to see what is in the directory:

  dir
  cd dataSets
  dir
  cd ..

dir

total 8
drwxr-sr-x. 2 izahn tutorwww 4096 Oct 22 21:59 dataSets/
drwxr-sr-x. 3 izahn tutorwww 4096 Oct 22 21:59 images/
cd dataSets
/nfs/www/edu-harvard-iq-tutorials/Stata/StataStatGraph/dataSets
dir

total 21008
-rwxr-xr-x. 1 izahn tutorwww 21103444 Oct 22 21:59 NatNeighCrimeStudy.dta*
-rwxr-xr-x. 1 izahn tutorwww     8977 Oct 22 21:59 states.dta*
-rwxr-xr-x. 1 izahn tutorwww   298191 Oct 22 21:59 TimePollPubSchools.dta*
cd ..
/nfs/www/edu-harvard-iq-tutorials/Stata/StataStatGraph

Load the data

  // use the states data set
  use dataSets/states.dta


use dataSets/states.dta
(U.S. states data 1990-91)

Simple regression

Steps for running regression

Examine descriptive statistics
Look at relationship graphically and test correlation(s)
Run and interpret regression
Test regression assumptions

Preliminaries

We want to predict csat scores from expense
First, let’s look at some descriptives

  // generate summary statistics for csat and expense
  sum csat expense


sum csat expense

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        csat |         51     944.098    66.93497        832       1093
     expense |         51    5235.961    1401.155       2960       9259

We want to predict csat scores from expense
First, let’s look at some descriptives

  // look at codebok
  codebook csat expense


codebook csat expense

-------------------------------------------------------------------------------
csat                                                   Mean composite SAT score
-------------------------------------------------------------------------------

                  type:  numeric (int)

                 range:  [832,1093]                   units:  1
         unique values:  45                       missing .:  0/51

                  mean:   944.098
              std. dev:    66.935

           percentiles:        10%       25%       50%       75%       90%
                               874       886       926       997      1024

-------------------------------------------------------------------------------
expense                                         Per pupil expenditures prim&sec
-------------------------------------------------------------------------------

                  type:  numeric (int)

                 range:  [2960,9259]                  units:  1
         unique values:  51                       missing .:  0/51

                  mean:   5235.96
              std. dev:   1401.16

           percentiles:        10%       25%       50%       75%       90%
                              3782      4351      5000      5865      6738

Next, view relationship graphically
Scatterplots work well for univariate relationships

  // graph expense by csat
  twoway scatter expense csat

Next look at the correlation matrix

  // correlate csat and expense
  pwcorr csat expense, star(.05)


pwcorr csat expense, star(.05)

             |     csat  expense
-------------+------------------
        csat |   1.0000 
     expense |  -0.4663*  1.0000

Not very interesting with only one predictor

SAT scores & Education Expenditures

  regress csat expense

regress csat expense

      Source |       SS           df       MS      Number of obs   =        51
-------------+----------------------------------   F(1, 49)        =     13.61
       Model |  48708.3001         1  48708.3001   Prob > F        =    0.0006
    Residual |   175306.21        49  3577.67775   R-squared       =    0.2174
-------------+----------------------------------   Adj R-squared   =    0.2015
       Total |   224014.51        50   4480.2902   Root MSE        =    59.814

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |  -.0222756   .0060371    -3.69   0.001    -.0344077   -.0101436
       _cons |   1060.732    32.7009    32.44   0.000     995.0175    1126.447
------------------------------------------------------------------------------

OLS assumptions

Assumption 1: Specification is appropriate (i.e., no relevant omitted variables)
Assumption 2: Homoscedasticity (The variance around the regression model is the same for all values of the predictor variable)
Assumption 3: Errors are independent
Assumption 4: Relationships are linear
Assumption 5: Normal Distribution of errors (only needed for inference)

Specification

The model specification should be informed by theory - i.e., our substantive knowledge of the subject matter. It’s important to include all relevant predictors in the model, otherwise our estimates will be biased.

Goodness of fit

Homoscedasticity

  rvfplot

rvfplot

Normality

A simple histogram of the residuals can be informative

  // graph the residual values of csat
  predict resid, residual
  histogram resid, normal


predict resid, residual
histogram resid, normal
(bin=7, start=-131.81111, width=38.329487)

Multiple Regression

Just keep adding predictors
Let’s try adding some predictors to the model of SAT scores
income :: % students taking SATs
percent :: % adults with HS diploma (high)

Preliminaries

As before, start with descriptive statistics and correlations

  // descriptive statistics and correlations
  sum income percent high
  pwcorr csat expense income percent high


sum income percent high

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |         51    33.95657    6.423134     23.465     48.618
     percent |         51    35.76471    26.19281          4         81
        high |         51    76.26078    5.588741       64.3       86.6
pwcorr csat expense income percent high

             |     csat  expense   income  percent     high
-------------+---------------------------------------------
        csat |   1.0000 
     expense |  -0.4663   1.0000 
      income |  -0.4713   0.6784   1.0000 
     percent |  -0.8758   0.6509   0.6733   1.0000 
        high |   0.0858   0.3133   0.5099   0.1413   1.0000

regress csat on exense, income, percent, and high

  regress csat expense income percent high

regress csat expense income percent high

      Source |       SS           df       MS      Number of obs   =        51
-------------+----------------------------------   F(4, 46)        =     51.86
       Model |  183354.603         4  45838.6508   Prob > F        =    0.0000
    Residual |  40659.9067        46  883.911016   R-squared       =    0.8185
-------------+----------------------------------   Adj R-squared   =    0.8027
       Total |   224014.51        50   4480.2902   Root MSE        =    29.731

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |   .0045604    .004384     1.04   0.304    -.0042641     .013385
      income |   .4437858   1.138947     0.39   0.699    -1.848795    2.736367
     percent |  -2.533084   .2454477   -10.32   0.000    -3.027145   -2.039024
        high |   2.086599   .9246023     2.26   0.029     .2254712    3.947727
       _cons |   836.6197   58.33238    14.34   0.000     719.2027    954.0366
------------------------------------------------------------------------------

Exercise 0

Multiple Regression

Open the datafile, states.dta.

Select a few variables to use in a multiple regression of your own. Before running the regression, examine descriptive of the variables and generate a few scatterplots.
Run your regression
Examine the plausibility of the assumptions of normality and homogeneity

Interactions

What if we wanted to test an interaction between percent & high?
Option 1: generate product terms by hand

  // generate product of percent and high
  gen percenthigh = percent*high 
  regress csat expense income percent high percenthigh


gen percenthigh = percent*high
regress csat expense income percent high percenthigh

      Source |       SS           df       MS      Number of obs   =        51
-------------+----------------------------------   F(5, 45)        =     46.11
       Model |  187430.401         5  37486.0801   Prob > F        =    0.0000
    Residual |  36584.1091        45  812.980201   R-squared       =    0.8367
-------------+----------------------------------   Adj R-squared   =    0.8185
       Total |   224014.51        50   4480.2902   Root MSE        =    28.513

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |   .0045575   .0042044     1.08   0.284    -.0039107    .0130256
      income |   .0887856    1.10374     0.08   0.936    -2.134261    2.311832
     percent |  -8.143002   2.516509    -3.24   0.002    -13.21151   -3.074493
        high |   .4240906   1.156545     0.37   0.716    -1.905311    2.753492
 percenthigh |   .0740926   .0330909     2.24   0.030     .0074441    .1407411
       _cons |    972.525    82.5457    11.78   0.000     806.2695    1138.781
------------------------------------------------------------------------------

What if we wanted to test an interaction between percent & high?
Option 2: Let Stata do your dirty work

  // use the # sign to represent interactions 
  regress csat percent high c.percent#c.high
  // same as . regress csat c.percent##high


regress csat percent high c.percent#c.high

      Source |       SS           df       MS      Number of obs   =        51
-------------+----------------------------------   F(3, 47)        =     77.39
       Model |  186302.091         3  62100.6971   Prob > F        =    0.0000
    Residual |  37712.4186        47  802.391885   R-squared       =    0.8317
-------------+----------------------------------   Adj R-squared   =    0.8209
       Total |   224014.51        50   4480.2902   Root MSE        =    28.327

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     percent |   -8.15717   2.488388    -3.28   0.002    -13.16316   -3.151179
        high |   .6674578   1.082615     0.62   0.541    -1.510482    2.845398
             |
   c.percent#|
      c.high |   .0764271   .0324919     2.35   0.023     .0110619    .1417924
             |
       _cons |   974.9354   81.98078    11.89   0.000     810.0113    1139.859
------------------------------------------------------------------------------

Categorical Predictors

For categorical variables, we first need to dummy code
Use region as example
- Option 1: create dummy codes before fitting regression model

  // create region dummy codes using tab 
  tab region, gen(region)

  //regress csat on region
  regress csat region1 region2 region3


tab region, gen(region)

Geographica |
   l region |      Freq.     Percent        Cum.
------------+-----------------------------------
       West |         13       26.00       26.00
    N. East |          9       18.00       44.00
      South |         16       32.00       76.00
    Midwest |         12       24.00      100.00
------------+-----------------------------------
      Total |         50      100.00


regress csat region1 region2 region3

      Source |       SS           df       MS      Number of obs   =        50
-------------+----------------------------------   F(3, 46)        =      9.61
       Model |  82049.4719         3   27349.824   Prob > F        =    0.0000
    Residual |  130911.908        46  2845.91105   R-squared       =    0.3853
-------------+----------------------------------   Adj R-squared   =    0.3452
       Total |   212961.38        49  4346.15061   Root MSE        =    53.347

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     region1 |  -63.77564   21.35592    -2.99   0.005    -106.7629    -20.7884
     region2 |  -120.5278   23.52385    -5.12   0.000    -167.8788   -73.17672
     region3 |  -80.08333   20.37225    -3.93   0.000    -121.0906   -39.07611
       _cons |   1010.083   15.39998    65.59   0.000     979.0848    1041.082
------------------------------------------------------------------------------

For categorical variables, we first need to dummy code
Use region as example
- Option 2: Let Stata do it for you

  // regress csat on region using fvvarlist syntax
  // see help fvvarlist for details
  regress csat i.region



regress csat i.region

      Source |       SS           df       MS      Number of obs   =        50
-------------+----------------------------------   F(3, 46)        =      9.61
       Model |  82049.4719         3   27349.824   Prob > F        =    0.0000
    Residual |  130911.908        46  2845.91105   R-squared       =    0.3853
-------------+----------------------------------   Adj R-squared   =    0.3452
       Total |   212961.38        49  4346.15061   Root MSE        =    53.347

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      region |
    N. East  |  -56.75214   23.13285    -2.45   0.018    -103.3161   -10.18813
      South  |  -16.30769   19.91948    -0.82   0.417    -56.40353    23.78814
    Midwest  |   63.77564   21.35592     2.99   0.005      20.7884    106.7629
             |
       _cons |   946.3077   14.79582    63.96   0.000     916.5253    976.0901
------------------------------------------------------------------------------

Exercise 1

Regression, Categorical Predictors, & Interactions

Open the datafile, states.dta.

Add on to the regression equation that you created in exercise 1 by generating an interaction term and testing the interaction.
Try adding a categorical variable to your regression (remember, it will need to be dummy coded). You could use region or generate a new categorical variable from one of the continuous variables in the dataset.

Exporting & saving results

Regression tables

Usually when we’re running regression, we’ll be testing multiple models at a time
Can be difficult to compare results
Stata offers several user-friendly options for storing and viewing regression output from multiple models
First, download the necessary packages:

  // install outreg2 package
  findit outreg2

Saving & replaying

You can store regression model results in Stata

  // fit two regression models and store the results
  regress csat expense income percent high
  estimates store Model1
  regress csat expense income percent high i.region
  estimates store Model2


regress csat expense income percent high

      Source |       SS           df       MS      Number of obs   =        51
-------------+----------------------------------   F(4, 46)        =     51.86
       Model |  183354.603         4  45838.6508   Prob > F        =    0.0000
    Residual |  40659.9067        46  883.911016   R-squared       =    0.8185
-------------+----------------------------------   Adj R-squared   =    0.8027
       Total |   224014.51        50   4480.2902   Root MSE        =    29.731

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |   .0045604    .004384     1.04   0.304    -.0042641     .013385
      income |   .4437858   1.138947     0.39   0.699    -1.848795    2.736367
     percent |  -2.533084   .2454477   -10.32   0.000    -3.027145   -2.039024
        high |   2.086599   .9246023     2.26   0.029     .2254712    3.947727
       _cons |   836.6197   58.33238    14.34   0.000     719.2027    954.0366
------------------------------------------------------------------------------
estimates store Model1
regress csat expense income percent high i.region

      Source |       SS           df       MS      Number of obs   =        50
-------------+----------------------------------   F(7, 42)        =     51.07
       Model |  190570.293         7  27224.3275   Prob > F        =    0.0000
    Residual |  22391.0874        42  533.121128   R-squared       =    0.8949
-------------+----------------------------------   Adj R-squared   =    0.8773
       Total |   212961.38        49  4346.15061   Root MSE        =    23.089

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |   -.004375   .0044603    -0.98   0.332    -.0133763    .0046263
      income |   1.306164    .950279     1.37   0.177    -.6115765    3.223905
     percent |  -2.965514   .2496481   -11.88   0.000    -3.469325   -2.461704
        high |   3.544804   1.075863     3.29   0.002     1.373625    5.715983
             |
      region |
    N. East  |   80.81334    15.4341     5.24   0.000     49.66607    111.9606
      South  |   33.61225   13.94521     2.41   0.020     5.469676    61.75483
    Midwest  |   32.15421   10.20145     3.15   0.003     11.56686    52.74157
             |
       _cons |   724.8289   79.25065     9.15   0.000     564.8946    884.7631
------------------------------------------------------------------------------
estimates store Model2

Stored models can be recalled

  // Display Model1
  estimates replay Model1


estimates replay Model1

-------------------------------------------------------------------------------
Model Model1
-------------------------------------------------------------------------------

      Source |       SS           df       MS      Number of obs   =        51
-------------+----------------------------------   F(4, 46)        =     51.86
       Model |  183354.603         4  45838.6508   Prob > F        =    0.0000
    Residual |  40659.9067        46  883.911016   R-squared       =    0.8185
-------------+----------------------------------   Adj R-squared   =    0.8027
       Total |   224014.51        50   4480.2902   Root MSE        =    29.731

------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |   .0045604    .004384     1.04   0.304    -.0042641     .013385
      income |   .4437858   1.138947     0.39   0.699    -1.848795    2.736367
     percent |  -2.533084   .2454477   -10.32   0.000    -3.027145   -2.039024
        high |   2.086599   .9246023     2.26   0.029     .2254712    3.947727
       _cons |   836.6197   58.33238    14.34   0.000     719.2027    954.0366
------------------------------------------------------------------------------

Stored models can be compared

  // Compare Model1 and Model2 coefficients
  estimates table Model1 Model2


estimates table Model1 Model2

----------------------------------------
    Variable |   Model1       Model2    
-------------+--------------------------
     expense |  .00456044   -.00437502  
      income |  .44378583    1.3061642  
     percent | -2.5330843   -2.9655142  
        high |  2.0865991    3.5448038  
             |
      region |
    N. East  |               80.813342  
      South  |               33.612251  
    Midwest  |               32.154215  
             |
       _cons |  836.61966    724.82886  
----------------------------------------

Exporting to Excel

Avoid human error when transferring coefficients into tables
Excel can be used to format publication-ready tables

  outreg2 [Model1 Model2] using csatprediction.xls, replace

outreg2 [Model1 Model2] using csatprediction.xls, replace
~/ado/plus/o/outreg2.ado
csatprediction.xls
dir : seeout

Graphing in Stata

Graphing Strategies

Keep it simple
Labels, labels, labels!!
Avoid cluttered graphs
Every part of the graph should be meaningful
Avoid:
- Shading
- Distracting colors
- Decoration
Always know what you’re working with before you get started
- Recognize scale of data
- If you’re using multiple variables – how do their scales align?
Before any graphing procedure review variables with codebook, sum, tab, etc.
HELPFUL STATA HINT: If you want your command to go on multiple lines use /// at end of each line

Terrible Graph

Much Better Graph

Univariate Graphics

Our First Dataset

Time Magazine Public School Poll
- Based on survey of 1,000 adults in U.S.
- Conducted in August 2010
- Questions regarding feelings about parental involvement, teachers union, current potential for reform
Open Stata and call up the datafile for today

  // Step 1: tell Stata where to find data:
  cd "~/StataGraphics/dataSets"
  // Step 2: call up our dataset:
  use TimePollPubSchools.dta

Single Continuous Variables

Example: Histograms

Stata assumes you’re working with continuous data
Very simple syntax:
- hist varname
Put a comma after your varname and start adding options
- bin(#) : change the number of bars that the graph displays
- normal : overlay normal curve
- addlabels : add actual values to bars

Histogram Options

To change the numeric depiction of your data add these options after the comma
- Choose one: density fraction frequency percent
Be sure to properly describe your histogram:
- title(insert name of graph)
- subtitle(insert subtitle of graph)
- note(insert note to appear at bottom of graph)
- caption(insert caption to appear below notes)

Histogram Example

  hist F1, bin(10) percent title(TITLE) ///
    subtitle(SUBTITLE) caption(CAPTION) note(NOTES)

Axis Titles & Labels

Axis title options (default is variable label):
- xtitle(insert x axis name)
- ytitle(insert y axis name)
Don’t want axis titles?
- xtitle("")
- ytitle("")
Add labels to X or Y axis:
- xlabel(insert x axis label)
- ylabel(insert y axis label)
Tell Stata how to scale each axis
- xlabel(start#(increment)end#)
- xlabel(0(5)100)
This would label x-axis from 0-100 in increments of 5

Axis Labels Example

  hist F1, bin(10) percent title(TITLE) subtitle(SUBTITLE) ///
      caption(CAPTION) note(NOTES) ///
      xtitle(Here's your x-axis title) ///
  ytitle(here's your y-axis title)

Single Categorical Variables

We can also use the hist command for bar graphs
- Simply specify “discrete” with options
Stata will produce one bar for each level (i.e. category) of variable
Use xlabel command to insert names of individual categories

  hist F4, title(Racial breakdown of Time Poll Sample) xtitle(Race) ///
  ytitle(Percent) xlabel(1 "White" 2 "Black" 3 "Asian" 4 "Hispanic" ///
   5 "Other") discrete percent addlabels

Exercise 2

Histograms Bar Graphs

Open the datafile, NatNeighCrimeStudy.dta.
Create a histogram of the tract-level poverty rate (variable name: T_POVRTY).
Insert the normal curve over the histogram
Change the numeric representation on the Y-axis to “percent”
Add appropriate titles to the overall graph and the x axis and y axis. Also, add a note that states the source of this data.
Open the datafile, TimePollPubSchools.dta
Create a histogram of the question, “What grade would you give your child’s school” (variable name: Q11). Be sure to tell Stata that this is a categorical variable.
Format this graph so that the axes have proper titles and labels. Also, add an appropriate title to the overall graph that goes onto two lines. Add a note stating the source of the data.

Bivariate Graphics

Next Dataset:

National Neighborhood Crime Study (NNCS)
- N=9,593 census tracts in 2000
- Explore sources of variation in crime for communities in the United States
- Tract-level data: crime, social disorganization, disadvantage, socioeconomic inequality
- City-level data: labor market, socioeconomic inequality, population change

The Twoway Family

twoway is basic Stata command for all twoway graphs
Use twoway anytime you want to make comparisons among variables
Can be used to combine graphs (i.e., overlay one graph with another
- e.g., insert line of best fit over a scatter plot
Some basic examples:

  use NatNeighCrimeStudy.dta
  twoway scatter T_PERCAP T_VIOLNT
  twoway dropline T_PERCAP T_VIOLNT
  twoway  lfitci T_PERCAP T_VIOLNT

Twoway & the by Statement

  twoway scatter T_PERCAP T_VIOLNT, by(DIVISION)

Twoway Title Options

Same title options as with histogram
- title(insert name of graph)
- subtitle(insert subtitle of graph)
- note(insert note to appear at bottom of graph)
- caption(insert caption to appear below notes)

Twoway Title Options Example

  twoway scatter T_PERCAP T_VIOLNT, ///
      title(Comparison of Per Capita Income ///
            and Violent Crime Rate at Tract level) ///
  xtitle(Violent Crime Rate) ytitle(Per Capita Income) ///
      note(Source: National Neighborhood Crime Study 2000)

The title is a bit cramped–let’s fix that:

  twoway scatter T_PERCAP T_VIOLNT, ///
      title("Comparison of Per Capita Income" ///
  "and Violent Crime Rate at Tract level") ///
  xtitle(Violent Crime Rate) ytitle(Per Capita Income) ///
  note(Source: National Neighborhood Crime Study 2000)

Twoway Symbol Options

A variety of symbol shapes are available: use palette symbolpalette to seem them and msymbol() to set them

Twoway Symbol Options

  twoway scatter T_PERCAP T_VIOLNT, ///
      title("Comparison of Per Capita Income" ///
  "and Violent Crime Rate at Tract level") ///
  xtitle(Violent Crime Rate) ytitle(Per Capita Income) ///
  note(Source: National Neighborhood Crime Study 2000) ///
  msymbol(Sh) mcolor("red")

Overlaying Twoway Graphs

Very simple to combine multiple graphs…just put each graph command in parentheses
- twoway (scatter var1 var2) (lfit var1 var2)
Add individual options to each graph within the parentheses
Add overall graph options as usual following the comma
- twoway (scatter var1 var2) (lfit var1 var2), options

Overlaying Points & Lines

  twoway (scatter T_PERCAP T_VIOLNT) ///
      (lfit T_PERCAP T_VIOLNT), ///
      title("Comparison of Per Capita Income" ///
            "and Violent Crime Rate at Tract level") ///
      xtitle(Violent Crime Rate) ytitle(Per Capita Income) ///
      note(Source: National  Neighborhood Crime Study 2000)

Overlaying Points & Labels

  twoway (scatter T_PERCAP T_VIOLNT if T_VIOLNT==1976, ///
          mlabel(CITY)) (scatter T_PERCAP T_VIOLNT), ///
      title("Comparison of Per Capita Income" ///
            "and Violent Crime Rate at Tract level") ///
      xlabel(0(200)2400) note(Source: National Neighborhood ///
                              Crime Study 2000) legend(off)

Exercise 3

The TwoWay Family

Open the datafile, NatNeighCrimeStudy.dta.

Create a basic twoway scatterplot that compares the city unemployment rate (C_UNEMP) to the percent secondary sector low-wage jobs (C_SSLOW)
Generate the same scatterplot, but this time, divide the plot by the dummy variable indicating whether the city is located in the south or not (C_SOUTH)
Change the color of the symbol that you use in this scatter plot
Change the type of symbol you use to a marker of your choice
Notice in your scatterplot that is broken down by C_SOUTH that there is an outlier in the upper right hand corner of the “Not South” graph. Add the city name label to this marker.
Review the options available under “help twoway_options” and change one aspect of your graph using an option that we haven’t already reviewed

Twoway Line Graphs

Line graphs helpful for a variety of data
- Especially any type of time series data
We’ll use data on US life expectancy from 1900-1999
- webuse uslifeexp, clear

  webuse uslifeexp, clear
  twoway (line le_wm year, mcolor("red")) ///
      (line le_bm year, mcolor("green"))

  twoway (line (le_wfemale le_wmale le_bf le_bm) year, ///
      lpattern(dot solid dot solid))

Stata Graphing Lines

  palette linepalette

Exporting Graphs

From Stata, right click on image and select “save as” or try syntax:
- graph export myfig.esp, replace
In Microsoft Word: insert -> picture -> from file
- Or, right click on graph in Stata and copy and paste into MS Word

Exercise Solutions

Ex 0: prototype

**

Ex 1: prototype

**

Ex 2: prototype

**

Ex 3: prototype

**

Wrap-up

Feedback

These workshops are a work in progress, please provide any feedback to: help@iq.harvard.edu

Resources

IQSS
- Workshops: https://dss.iq.harvard.edu/workshop-materials
- Data Science Services: https://dss.iq.harvard.edu/
- Research Computing Environment: https://iqss.github.io/dss-rce/
HBS
- Research Computing Services workshops: https://training.rcs.hbs.org/workshops
- Other HBS RCS resources: https://training.rcs.hbs.org/workshop-materials
- RCS consulting email: mailto:research@hbs.edu
Stata
- UCLA website: http://www.ats.ucla.edu/stat/Stata/
- Stata website: http://www.stata.com/help.cgi?contents
- Email list: http://www.stata.com/statalist/