skip to content
Andrew Marder

Stata is a commonly used tool for empirical research. Stata comes with an extensive library of statistical methods, and there are additional user written methods that extend the functionality of Stata even further.

Stata stores data in memory as a single matrix. If you are familiar with Microsoft Excel Workbooks, Stata stores a single Worksheet in memory where each column has a name and each row is numbered from 1 to the total number of rows in the dataset.

This tutorial aims to introduce you to the key features of Stata and its documentation so you can start your own empirical work.

Typing Commands

The display command is useful for showing values at the command line.

. display 1 + 2
3

Use the Page Up key to recall the previous command evaluated. This is particularly useful if you need to fix a typo.

Commands can be abbreviated, di is equivalent to display. I prefer to use the whole command name because it makes code explicit.

Getting Help

Use the help command if you know the name of the function and want more details. Use the findit command if you want to find a function. I end up using Google more than findit, but this may be a mistake.

Unfortunately the help command opens a new window each time you use it, use the nonew option to prevent this behavior, help help, nonew.

Reading Data Into Stata

There are many different ways to read data into Stata. To get a good overview of how to import data into Stata type help import in Stata’s Command window. The functions I use most are import excel and insheet. import excel is great if you are working with an Excel workbook, while insheet is great if you have a comma-separated values (csv) file.

Stata datasets are generally stored in files with a .dta extension. To read a Stata dataset use the use command. For the purpose of this tutorial we will use a dataset shipped with Stata about automobiles. Type in sysuse auto to load the dataset into memory.

. sysuse auto, clear
(1978 Automobile Data)

Descriptive Statistics

The describe command gives useful information about the variables in the dataset and the number of rows in the dataset.

. describe
Contains data from /Applications/Stata/ado/base/a/auto.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2011 17:45
size: 3,182 (_dta has notes)
--------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
--------------------------------------------------------------------------------------------------------
Sorted by: foreign

The summarize command gives some useful summary statistics for each variable.

. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
make | 0
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
headroom | 74 2.993243 .8459948 1.5 5
-------------+--------------------------------------------------------
trunk | 74 13.75676 4.277404 5 23
weight | 74 3019.459 777.1936 1760 4840
length | 74 187.9324 22.26634 142 233
turn | 74 39.64865 4.399354 31 51
displacement | 74 197.2973 91.83722 79 425
-------------+--------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1

You’ll notice that 11 of 12 variables in the auto dataset are numeric and the make variable is a string. To see what the make variable looks like, we can list the first few observations.

. list make if _n <= 5
+---------------+
| make |
|---------------|
1. | AMC Concord |
2. | AMC Pacer |
3. | AMC Spirit |
4. | Buick Century |
5. | Buick Electra |
+---------------+

To see if make uniquely identifies each row in the dataset we can use the isid function.

. isid make

When isid says nothing the variable list does uniquely identify each row. Are cars uniquely identified by their weight and length?

. duplicates report make
Duplicates in terms of make
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 74 0
--------------------------------------
. duplicates report weight length
Duplicates in terms of weight length
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 70 0
2 | 4 2
--------------------------------------

Imagine we are interested in looking at how foreign and domestic cars differ. As a first step, it would be good to examine some summary statistics for foreign and domestic cars, the tabstat command makes this fairly easy.

. tabstat price mpg weight length, by(foreign) stat(mean sd)
Summary statistics: mean, sd
by categories of: foreign (Car type)
foreign | price mpg weight length
---------+----------------------------------------
Domestic | 6072.423 19.82692 3317.115 196.1346
| 3097.104 4.743297 695.3637 20.04605
---------+----------------------------------------
Foreign | 6384.682 24.77273 2315.909 168.5455
| 2621.915 6.611187 433.0035 13.68255
---------+----------------------------------------
Total | 6165.257 21.2973 3019.459 187.9324
| 2949.496 5.785503 777.1936 22.26634
--------------------------------------------------

You may have noticed from the output of the summarize command that rep78 has 5 missing values. We can look at those observations using the list command:

. list if missing(rep78)
+---------------------------------------------------------------------------------------------+
3. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| AMC Spirit | 3,799 | 22 | . | 3.0 | 12 | 2,640 | 168 | 35 | 121 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 3.08 | Domestic |
+---------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
7. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| Buick Opel | 4,453 | 26 | . | 3.0 | 10 | 2,230 | 170 | 34 | 304 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 2.87 | Domestic |
+---------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
45. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| Plym. Sapporo | 6,486 | 26 | . | 1.5 | 8 | 2,520 | 182 | 38 | 119 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 3.54 | Domestic |
+---------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
51. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| Pont. Phoenix | 4,424 | 19 | . | 3.5 | 13 | 3,420 | 203 | 43 | 231 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 3.08 | Domestic |
+---------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
64. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| Peugeot 604 | 12,990 | 14 | . | 3.5 | 14 | 3,420 | 192 | 38 | 163 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 3.58 | Foreign |
+---------------------------------------------------------------------------------------------+

Graphs

There are good graph galleries provided by StataCorp, UCLA, and Survey Design and Analysis Services. Below is a simple scatter plot of weight versus length:

. graph twoway scatter weight length
. graph export scatter.png, replace
(file scatter.png written in PNG format)
Scatter Plot

Creating New Variables

There are a number of ways to create new variables or modifying existing variables. The most important command in this section is the generate command. Imagine we are curious about cars that are heavy for their length we could create a new variable

. generate weight_per_length = weight / length

This creates a new column in the dataset, for each car we have calculated the ratio of that car’s weight to its length. Let’s take a look at the top five heaviest cars per length.

. gsort -weight_per_length
. list make weight_per_length if _n <= 5
+------------------------------+
| make weight~h |
|------------------------------|
1. | Cad. Seville 21.02941 |
2. | Linc. Continental 20.77253 |
3. | Linc. Mark V 20.52174 |
4. | Cad. Deville 19.59276 |
5. | Olds Toronado 19.56311 |
+------------------------------+

Another very useful command for generating new variables is the egen command. This is particularly useful is you want to merge summary statistics for groups of cars back into the larger dataset. For instance, we might be curious to see how a car’s price compares to the average price among foreign or domestic cars. We can find the average price for foreign and domestic cars using tabstat, but how do we make a column in the dataset with these values?

. tabstat price, by(foreign)
Summary for variables: price
by categories of: foreign (Car type)
foreign | mean
---------+----------
Domestic | 6072.423
Foreign | 6384.682
---------+----------
Total | 6165.257
--------------------
. egen ave_price = mean(price), by(foreign)
. list foreign ave_price
+---------------------+
| foreign ave_pr~e |
|---------------------|
1. | Domestic 6072.423 |
2. | Domestic 6072.423 |
3. | Domestic 6072.423 |
4. | Domestic 6072.423 |
5. | Domestic 6072.423 |
|---------------------|
6. | Domestic 6072.423 |
7. | Domestic 6072.423 |
8. | Domestic 6072.423 |
9. | Domestic 6072.423 |
10. | Domestic 6072.423 |
|---------------------|
11. | Domestic 6072.423 |
12. | Domestic 6072.423 |
13. | Domestic 6072.423 |
14. | Domestic 6072.423 |
15. | Foreign 6384.682 |
|---------------------|
16. | Domestic 6072.423 |
17. | Domestic 6072.423 |
18. | Domestic 6072.423 |
19. | Domestic 6072.423 |
20. | Domestic 6072.423 |
|---------------------|
21. | Domestic 6072.423 |
22. | Domestic 6072.423 |
23. | Domestic 6072.423 |
24. | Domestic 6072.423 |
25. | Domestic 6072.423 |
|---------------------|
26. | Domestic 6072.423 |
27. | Domestic 6072.423 |
28. | Domestic 6072.423 |
29. | Domestic 6072.423 |
30. | Domestic 6072.423 |
|---------------------|
31. | Domestic 6072.423 |
32. | Domestic 6072.423 |
33. | Domestic 6072.423 |
34. | Domestic 6072.423 |
35. | Foreign 6384.682 |
|---------------------|
36. | Domestic 6072.423 |
37. | Domestic 6072.423 |
38. | Domestic 6072.423 |
39. | Domestic 6072.423 |
40. | Domestic 6072.423 |
|---------------------|
41. | Domestic 6072.423 |
42. | Domestic 6072.423 |
43. | Domestic 6072.423 |
44. | Foreign 6384.682 |
45. | Domestic 6072.423 |
|---------------------|
46. | Domestic 6072.423 |
47. | Foreign 6384.682 |
48. | Foreign 6384.682 |
49. | Foreign 6384.682 |
50. | Domestic 6072.423 |
|---------------------|
51. | Domestic 6072.423 |
52. | Foreign 6384.682 |
53. | Foreign 6384.682 |
54. | Domestic 6072.423 |
55. | Foreign 6384.682 |
|---------------------|
56. | Domestic 6072.423 |
57. | Foreign 6384.682 |
58. | Foreign 6384.682 |
59. | Foreign 6384.682 |
60. | Domestic 6072.423 |
|---------------------|
61. | Foreign 6384.682 |
62. | Domestic 6072.423 |
63. | Domestic 6072.423 |
64. | Foreign 6384.682 |
65. | Foreign 6384.682 |
|---------------------|
66. | Foreign 6384.682 |
67. | Foreign 6384.682 |
68. | Foreign 6384.682 |
69. | Foreign 6384.682 |
70. | Domestic 6072.423 |
|---------------------|
71. | Foreign 6384.682 |
72. | Foreign 6384.682 |
73. | Foreign 6384.682 |
74. | Domestic 6072.423 |
+---------------------+

Regressions

To further explore the relationship between weight and length we can run a regression.

. regress weight length
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 613.27
Model | 39461306.8 1 39461306.8 Prob > F = 0.0000
Residual | 4632871.55 72 64345.4382 R-squared = 0.8949
-------------+------------------------------ Adj R-squared = 0.8935
Total | 44094178.4 73 604029.841 Root MSE = 253.66
------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
length | 33.01988 1.333364 24.76 0.000 30.36187 35.67789
_cons | -3186.047 252.3113 -12.63 0.000 -3689.02 -2683.073
------------------------------------------------------------------------------

We see that on average, each additional inch is associated with 33 pounds. We can plot the predicted values from the regression on the scatter plot from above.

. graph twoway (scatter weight length) (lfit weight length)
. graph export scatter_lfit.png, replace
(file scatter_lfit.png written in PNG format)
Scatter Plot

Further Reading

Germán Rodríguez’s Stata Tutorial is an excellent introduction to Stata..

These notes on writing code by Matthew Gentzkow and Jesse Shapiro have excellent suggestions on how to program with Stata.