Vous êtes sur la page 1sur 75

An Introduction to Stata

ii
An Introduction to Stata

F. PERACCHI
Faculty of Economics, Tor Vergata University, Rome, Italy
iv
Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 STARTING AND STOPPING STATA . . . . . . . . . . . . . . . . . . 1
1.1.1 THE STATA WINDOWS . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 THE STATA TOOLBAR . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 ALLOCATING MEMORY TO STATA . . . . . . . . . . . . . 2
1.2 STATA DOCUMENTATION AND UPDATES . . . . . . . . . . . . . 3
1.2.1 THE HELP SYSTEM . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 THE REFERENCE MANUAL . . . . . . . . . . . . . . . . . . 3
1.2.3 THE STATA TECHNICAL BULLETIN . . . . . . . . . . . . . 3
1.2.4 TUTORIALS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 STATA UPDATES . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 VARIABLES AND OBSERVATIONS . . . . . . . . . . . . . . . . . . 4
1.3.1 VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 OBSERVATIONS . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 INPUTTING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 DIRECT TYPING . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 THE DATA EDITOR . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.3 LOADING AN ASCII (TEXT) DATA FILE . . . . . . . . . . 6
1.4.4 LOADING A STATA DATA FILE . . . . . . . . . . . . . . . . 7
1.5 BASIC DATA MANIPULATION . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 DISPLAYING DATA . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.2 LABELING DATA . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.3 SUMMARIZING DATA . . . . . . . . . . . . . . . . . . . . . . 9
1.5.4 CREATING NEW VARIABLES . . . . . . . . . . . . . . . . . 9
1.5.5 CHANGING AND RENAMING VARIABLES . . . . . . . . . 11
1.5.6 ELIMINATING VARIABLES OR OBSERVATIONS . . . . . . 12
1.5.7 INCREASING THE NUMBER OF OBSERVATIONS IN A
DATASET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 OUTPUTTING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 LOG FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Stata Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 GENERAL SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vi CONTENTS

2.1.1 BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 WEIGHTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 IF AND IN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 QUIETLY AND NOISILY . . . . . . . . . . . . . . . . . . . . . 16
2.2 BASIC DATA COMMANDS . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 DESCRIBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 LIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 DROP AND KEEP . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 GENERATE AND EGEN . . . . . . . . . . . . . . . . . . . . . 18
2.2.5 REPLACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.6 SORT AND GSORT . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 COMBINING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 APPEND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 MERGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 RESHAPING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 COLLAPSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 CONTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 EXPAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.4 FILLIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.5 RESHAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 BASIC SAMPLE STATISTICS . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 COUNT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 SUMMARIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.3 MEANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.4 CENTILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.5 CUMUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.6 CORRELATE . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.7 REGRESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.1 TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 TABULATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.3 TABSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 BASIC SYNTAX AND GRAPHIC STYLES . . . . . . . . . . . . . . 29
3.2 COMMON GRAPH OPTIONS . . . . . . . . . . . . . . . . . . . . . . 30
3.3 HISTOGRAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 TWO-WAY SCATTERPLOTS . . . . . . . . . . . . . . . . . . . . . . 32
3.5 TWO-WAY SCATTERPLOT MATRICES . . . . . . . . . . . . . . . 33
3.6 BOX PLOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Programming and Matrix Commands . . . . . . . . . . . . . . . . . . 35


4.1 PROGRAMMING STATA . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 MACROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 SYSTEM MACROS . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3 LOOPING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.4 BRANCHING . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
CONTENTS vii

4.1.5 PROGRAM ARGUMENTS . . . . . . . . . . . . . . . . . . . . 37


4.1.6 TEMPORARY OBJECTS . . . . . . . . . . . . . . . . . . . . 39
4.1.7 EXCHANGING RESULTS BETWEEN PROGRAMS . . . . . 39
4.2 DO FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 ADO FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 MATRIX COMMANDS . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 ROW AND COLUMN NAMES . . . . . . . . . . . . . . . . . . 42
4.4.2 SUBSCRIPTING AND SUBMATRICES . . . . . . . . . . . . 42
4.4.3 MATRIX OPERATORS AND FUNCTIONS . . . . . . . . . . 43
4.4.4 CROSS-PRODUCT MATRICES . . . . . . . . . . . . . . . . . 45
4.4.5 DATA TO MATRIX CONVERSION . . . . . . . . . . . . . . . 45
4.4.6 GETTING SYSTEM MATRICES . . . . . . . . . . . . . . . . 46
4.4.7 MATRIX DECOMPOSITION . . . . . . . . . . . . . . . . . . 46

5 Statistical Inference Using Stata . . . . . . . . . . . . . . . . . . . . . 47


5.1 ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 GENERAL SYNTAX OF ESTIMATION COMMANDS . . . . 47
5.1.2 WEIGHTED ESTIMATION . . . . . . . . . . . . . . . . . . . 48
5.1.3 CONSTRAINED ESTIMATION . . . . . . . . . . . . . . . . . 48
5.1.4 ROBUST VARIANCE ESTIMATES . . . . . . . . . . . . . . . 48
5.2 POST-ESTIMATION COMMANDS . . . . . . . . . . . . . . . . . . . 49
5.2.1 ACCESSING COEFFICIENTS AND STANDARD ERRORS . 49
5.2.2 DISPLAYING THE VARIANCE ESTIMATES . . . . . . . . . 49
5.2.3 PREDICTIONS AND RESIDUALS . . . . . . . . . . . . . . . 49
5.2.4 HYPOTHESIS TESTING . . . . . . . . . . . . . . . . . . . . . 51
5.3 BOOTSTRAPPING AND MONTE CARLO SIMULATIONS . . . . . 51
5.3.1 BOOTSTRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.2 MONTE CARLO SIMULATION . . . . . . . . . . . . . . . . . 52

6 Statistical Models in Stata . . . . . . . . . . . . . . . . . . . . . . . . . 53


6.1 LINEAR MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.1 ORDINARY LEAST SQUARES . . . . . . . . . . . . . . . . . 53
6.1.2 CONSTRAINED LINEAR REGRESSION . . . . . . . . . . . 54
6.1.3 LINEAR INSTRUMENTAL VARIABLES . . . . . . . . . . . . 54
6.2 GENERALIZED LINEAR MODELS . . . . . . . . . . . . . . . . . . . 55
6.2.1 GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.2 LOGIT AND PROBIT . . . . . . . . . . . . . . . . . . . . . . 57
6.2.3 POISSON AND NBREG . . . . . . . . . . . . . . . . . . . . . 57
6.3 OTHER LIMITED DEPENDENT VARIABLES MODELS . . . . . . 58
6.3.1 GROUPED BINARY RESPONSES . . . . . . . . . . . . . . . 58
6.3.2 ORDERED CATEGORICAL RESPONSES . . . . . . . . . . . 58
6.3.3 NESTED LOGIT . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.4 MULTINOMIAL LOGIT . . . . . . . . . . . . . . . . . . . . . 59
6.3.5 BIPROBIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.6 CENSORED AND TRUNCATED REGRESSION . . . . . . . 59
6.4 DURATION DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4.1 PARAMETRIC DURATION MODELS . . . . . . . . . . . . . 59
viii CONTENTS

6.4.2 COX PROPORTIONAL HAZARD MODEL . . . . . . . . . . 60


6.5 TIME SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5.1 LINEAR MODELS WITH AUTOCORRELATED ERRORS . 60
6.5.2 ARIMA MODELS . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5.3 ARCH-TYPE MODELS . . . . . . . . . . . . . . . . . . . . . . 60
6.6 PANEL DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6.1 LINEAR PANEL DATA MODELS . . . . . . . . . . . . . . . . 61
6.6.2 DYNAMIC PANEL DATA MODELS . . . . . . . . . . . . . . 62
6.6.3 SEEMINGLY UNRELATED REGRESSION EQUATIONS . . 62
6.6.4 GEE FOR PANEL DATA . . . . . . . . . . . . . . . . . . . . . 62
6.6.5 LOGIT AND PROBIT FOR PANEL DATA . . . . . . . . . . 62
6.6.6 POISSON AND NEGATIVE BINOMIAL MODELS . . . . . . 63
6.7 NONPARAMETRIC ESTIMATION . . . . . . . . . . . . . . . . . . . 63
6.7.1 DENSITY ESTIMATION . . . . . . . . . . . . . . . . . . . . . 63
6.7.2 REGRESSION SMOOTHERS . . . . . . . . . . . . . . . . . . 63
6.8 ROBUST AND QUANTILE REGRESSION . . . . . . . . . . . . . . . 64
6.8.1 ROBUST REGRESSION . . . . . . . . . . . . . . . . . . . . . 64
6.8.2 QUANTILE REGRESSION . . . . . . . . . . . . . . . . . . . . 64
6.9 GENERAL NONLINEAR METHODS . . . . . . . . . . . . . . . . . . 64

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Introduction

Why use Stata? In my view, it has three main advantages over other statistical
packages.

The first is portability: Stata runs on several platforms (Macintosh, Unix, Windows),
and Stata programs written for one of them run with (almost) no change on any other
one. The latest release is Stata 7.0. In what follows I focus on Stata 7.0 for Windows
98/95/NT.

The second is speed: Stata is fast because all data manipulations are carried out in
the RAM. The only limit is the amount of RAM available. With 100MB of RAM, one
can work with a dataset containing 5 million observations on 4 real-valued variables
or, equivalently, with one million observations on 20 real-valued variables.

The third advantage is that Stata contains “state-of-the-art” statistical procedures, is


programmable, and is fully integrated with a matrix language.

This introduction to Stata is organized as follows. Chapter 1 describes the main


features of the program. Chapter 2 introduces the syntax of a Stata command
and presents some of the most used commands. Chapter 3 describes Stata graphic
capabilities. Chapter 4 introduces the elements of Stata programming and the Stata
matrix language. Chapter 5 shows how to carry out statistical inference (estimation,
prediction, hypothesis testing) using Stata. Finally, Chapter 6 reviews the main classes
of statistical models implemented in Stata.
1

Getting Started
This chapter introduces the main aspects of Stata, namely how to start and stop the
program (Section 1.1), where to look for documentation and updates (Section 1.2), the
definition of variables and observations (Section 1.3), how to input data (Section 1.4),
basic data manipulation (Section 1.5), how to output data (Section 1.6), and how to
open and close log files (Section 1.7).
I adopt the following typographic conventions: the typewriter-style typeface is used
for Stata commands or options that have to be typed in (e.g. describe or generate),
italics is used for things that must be substituted for by some other word (e.g. varname
or varlist), small caps is used for keyboard keys (e.g. Enter or Ctrl+Break) and
boldface is used for Windows commands or switches (e.g. Exit or Help).

1.1 STARTING AND STOPPING STATA


To start Stata, click on the Stata icon. To exit Stata, type exit or choose Exit from
the File menu. To exit Stata when there are data in memory which have not been
saved, type exit, clear.
To test the installation of Stata, type verinst. To test that the supplied ado files (see
Section 4.3) are correctly installed, type crc.
To make Stata stop what is doing and return to the Stata prompt, click the Break
button or press Ctrl+Break.

1.1.1 THE STATA WINDOWS


The Stata windows consists of:
• the Stata Command window (where commands are typed in and then issued
by pressing Enter),

• the Stata Results window (where results are displayed),

• the Review window (it shows the past commands),

• the Variables window (it shows the list of variables).


All these windows may be resized and rearranged. The windowing preferences may be
saved by choosing Prefs from the main menu bar.
2

The Stata Command window follows standard Window editing style. The keys for
editing in the Command window are Delete, Backspace, Esc, Home, End, Page
Up and Page Down. One can copy one line at a time from the Results window into
the clipboard and paste into the Command window. Clicking once a command in the
Review window copies the command on the Command window where it can be edited
before being entered.

1.1.2 THE STATA TOOLBAR


Going from left to right, the Stata toolbar contains the following buttons (holding the
mouse pointer over each button, a box with a brief description will appear):

1. Open (opens a Stata dataset),

2. Save (saves to disk the Stata dataset currently in memory),

3. Print (prints a graph or log),

4. Begin Log (starts a new log, appends to an existing log, and stops or suspends
the current log),

5. Start Viewer (opens the Stata viewer for help on Stata),

6. Bring Dialog Window to Front (brings the Dialog window to the front of the
other Stata windows),

7. Bring Results Window to Front (brings the Result window to the front of
the other Stata windows),

8. Bring Graph Window to Front (brings the Graph window to the front of the
other Stata windows),

9. Do-file Editor (opens the Do-file editor or brings the Do-file Editor window
to the front of the other Stata windows),

10. Data Editor (opens the data editor or brings the Data Editor window to the
front of the other Stata windows),

11. Data Browser ((opens the data browser or brings the Data Browser window
to the front of the other Stata windows),

12. Clear — more — Condition (tells Stata to continue when it has puased in the
middle of a long output),

13. Break (stops the current task in Stata).

1.1.3 ALLOCATING MEMORY TO STATA


Initially, Stata allocates 1MB of memory to each session. Under Windows, to
permanently change the amount of memory used every time Stata is invoked, click on
GETTING STARTED 3

the Stata icon, pull down File and choose Properties, click on the Shortcut tab
and put k# (in kilobytes) or m# (in megabytes) after the call to wstata.exe in the
Target line.
The Start in line specifies the initial working directory. This may be changed
by editing the line or by using the cd drive:/directory_name command from the
Command line.
Memory allocation may also be changed within a given Stata session (although not
permanently) by using the set memory #k command (in kilobytes) or the set memory
#m command (in megabytes). This command requires that no data be present in
memory.
If more memory is used than physically available on the computer, Stata slows down.
In this case, it is recommended to set virtual memory on by typing
. set virtual on

1.2 STATA DOCUMENTATION AND UPDATES


The main documentation comes from the help system and the Stata reference manual.
Additional documentation on Stata developments and updates is available through the
Stata Technical Bulletin and the Stata Web site http://www.stata.com.

1.2.1 THE HELP SYSTEM


On-line help can be accessed by opening the Stata viewer from the toolbar or by
choosing Help from the main menu bar. In either case, selecting Contents opens the
table of contents for on-line help. Selecting Search . . . and then entering keyword
searches for keyword in the list of help entries.
On-line help can also be accessed from the Command line by typing help keyword or
lookup keyword. Try help, or help contents, or help list.

1.2.2 THE REFERENCE MANUAL


It consists of the introductory booklet Getting Started with Stata, plus seven volumes:
the User’s Guide, the Graphics Manual, the Programming Manual and the Reference
Manual in four volumes.

1.2.3 THE STATA TECHNICAL BULLETIN


The Stata Technical Bulletin (STB) is a printed and electronic journal with
corresponding software. It contains articles written by Stata Corp., Stata users,
and others. Articles have included enhancements to Stata (ado-files), tutorials on
programming strategies, illustrations of data analysis techniques, discussions on
teaching statistics, debates on appropriate statistical techniques, reports on other
programs, along with interesting datasets, questions, and suggestions.
The STB is published every two months (in January, March, May, July, September
and November). Every year, the 6 issues are bound into a volume.
4

1.2.4 TUTORIALS
Stata provides tutorials on a variety of aspects: introduction to Stata (intro.tut),
data input, graphics, tables, and procedures for statistical modeling.
To run a tutorial, type tutorial tutname, where tutname is any of the following:

• contents (lists the available official Stata tutorials),

• intro (introductory tutorial),

• yourdata (how to input data),

• graphics (how to make graphs),

• tables (how to make tables),

• regress (estimating regression models, including 2SLS),

• anova (estimating one-, two- and N-way ANOVA and ANCOVA models),

• factor (estimating factor and principal component models),

• logit (estimating maximum-likelihood logit and probit models),

• survival (estimating maximum-likelihood survival models),

• ourdata (description of the data provided by Stata).

1.2.5 STATA UPDATES


The Web site http://www.stata.com contains, among other things, answers to
frequently asked questions (FAQs), free additions to Stata (“Cool ado-files”) and the
latest official updates to Stata, which are released fairly frequently (every 3—4 weeks).
The latter can be downloaded directly using the update command.
Another useful command is net, which fetches and installs additions to Stata obtained
from the Internet or from media. The additions can be ado-files (new commands), help
files, or even datasets. Collections of files are bound together into packages. The net
search keywords command searches the Internet for user-written additions to Stata
that contain the specified keywords.

1.3 VARIABLES AND OBSERVATIONS


In Stata, variables are associated with the columns of a data matrix, observations with
its rows.

1.3.1 VARIABLES
Variables come in two types: alphabetic (strings) or numeric (real or integer valued).
Variables are called by their name, which must be 1 to 8 characters long. The first
GETTING STARTED 5

character must be a letter or an underscore, the other characters can be letters, digits
or underscores (spaces or other characters are not allowed). It is better to avoid using
the name e for variables and beginning variable names with an underscore (all Stata
build-in variables begin with an underscore). Stata is case sensitive (xx, xX, Xx and XX
are all different names). There are a few reserved names that cannot be used: e.g. if,
in, int, with.
Variables can be renamed using the rename command. For example: rename x y
renames the variable x as y.
Associated with each type of variable is a storage type. The available storage types
are:
• String variables: str#, where # is an integer between 1 and 80 specifying
the number of characters in the string. The maximum length of a string is 80
characters.

• Real valued variables: double (double precision or about 16 digits of accuracy)


and float (single precision or about 7 digits of accuracy). Notice that Stata
uses ‘.’ (a period) to denote both the decimal symbol and missing numerical
values.

• Integer valued variables: long (integers between -2,147,483,648 and


2,147,483,646), int (integers between -32,768 and 32,766) and byte (integers
between -127 and 126).

A double occupies twice as much space as a float, a long occupies the same space
as a float, an int occupies half the space of a float, and a byte occupies half the
space of an int. Thus, a byte occupies 1/4 of the space of a float and 1/8 of the
space of a double, thereby allowing to store categorical and indicator variables very
efficiently.
The default for storing numeric variables is float, but Stata performs all internal
calculations in double.
The compress command may be used to automatically optimize the storage type of
the data in memory.

1.3.2 OBSERVATIONS
Observations correspond to the row of a data matrix. Stata automatically creates and
updates the build-in system variable _n, which is a counter containing the number of
the current observation, and the system macro _N, which contains the total number
of observation in the dataset.
Notice that when the sorting of the data changes, so does the counter _n.

1.4 INPUTTING DATA


Data can be inputted into Stata by direct typing, with the data editor or from a file
(an ASCII file or a Stata data file).
6

1.4.1 DIRECT TYPING


The input command allows typing data directly into the dataset in memory.
Example:
. input x1 x2 x3
. 4 3.5 "J. Neyman"
. 7 .01 "R.A. Fisher"
. 2 11.5 "J. Tukey"
. 3 21.1 "D.R. Cox"

1.4.2 THE DATA EDITOR


The data editor corresponds to the edit command. It may be accessed by clicking
the Data Editor button on the Stata toolbar. The data editor is like a standard
spreadsheet with colums corresponding to variables and rows to observations. Data
may be entered or modified by choosing the cell, typing the value and then pressing
Enter or Tab.
With the data editor, quotes around strings are unnecessary. Missing numeric values
are recorded as ‘.’ (a period), missing string values are just empty strings.
The data editor initially names variables var1, var2, . . . . Variables may be renamed
by doubly-clicking anywhere in the variable’s column, thus bringing up the Variable
information dialog.
The data editor allows copying and pasting data created by other spreadsheet or
database programs.
It is important to always check the numeric format of a spreadsheet before copying
data to Stata. Unless otherwise instructed, Stata will interpret the number 1,314 as a
string.

1.4.3 LOADING AN ASCII (TEXT) DATA FILE


Stata offers three basic commands for loading an ASCII (text) data file.
The first command, infile, is a very flexible way of reading an ASCII (text) datafile
from disk into memory. The data can be in either free- or fixed-format, and a single
observation may span any number of input lines.
The basic syntax for data in free-format (data may be separated by spaces, tabs or
commas) is:
infile varlist using filename [, clear]
where varlist is a list of variable names with blanks in between (that is, varname1
varname2 . . . ), filename is the name of the disk datafile (including the path, if
necessary) and clear is an option that clears data loaded in memory without saving
them (I follow the convention of denoting items that are optional by enclosing them in
square brackets). If the file name is specified without an extension, .raw is assumed.
GETTING STARTED 7

If the data are in fixed-format, a dictionary (.dct) file is necessary. A dictionary is an


ASCII (text) file which describes the contents of a datafile. The data may be in the
same file as the dictionary or in another file. The basic syntax is:
infile filename [, using(filename2) clear]
where filename is the name of the dictionary file and filename2 is the name of the
file containing the data. If the using() option is not specified, the data is assumed
to follow the dictionary in filename or, if the dictionary specifies the name of some
other file, that file is assumed to contain the data. Notice that if using(filename2)
is specified, filename2 is used to obtain the data even if the dictionary itself says
otherwise.
The basic syntax of a dictionary file is the following:
[infile] dictionary [using filename] {
* comments may be included freely
*
[type] varname
}
(data might appear here)
The second is the infix command, which reads ASCII files in fixed-column format.
Again, a single observation may span any number of input lines. It is somewhat easier
but less flexible than infile. Its basic syntax is:
infix using filename [, using(filename2) clear]
where filename is the name of a dictionary file and filename2 is the name of the file
containing the data.
The third is the insheet command, which reads ASCII files created by a spreadsheet
or database program. Regardless of the creator, this command reads ASCII files where
there is one observation per line and the values are separated by tabs or commas. The
first line of the file may contain the variable names. The basic syntax is:
insheet [varlist] using filename [, {comma|tab|delimiter("char")} clear]
The {comma|tab|delimiter("char")} option tells Stata how values are separated in
the file (I follow the convention of denoting the available alternatives by enclosing
them in curley brackets, with a vertical bar as a separator). Specifying tab or comma
is not necessary because insheet can determine the separation character for itself
when the character is a tab or comma.
The insheet command can also determine for itself whether the file includes variable
names.

1.4.4 LOADING A STATA DATA FILE


Stata data files have the default extension dta.
Stata data files on disk may be loaded using the use command. The basic syntax is:
use filename [, clear]
where filename contains the full path to the data.
8

1.5 BASIC DATA MANIPULATION


I now introduce some basic facilities for manipulating data.

1.5.1 DISPLAYING DATA


Strings and values of scalar expressions may be displayed using the display command.
This command may also be used interactively as a substitute for a hand calculator.
Examples:
. display "this is a string"
. display 5+exp(ln(10))
. display "the value of f(x) is" 5+exp(ln(10))
The content of the dataset in memory may be displayed by using the describe
command. The content of a Stata data file on disk may be described without actually
loading it by using the describe using filename command.
The complementary ds command lists variable names in a compact format, whereas
lookfor string helps in finding variables by searching for string among all variable
names and labels.
The list [varlist] command displays the values of variables. If no varlist is specified,
the values of all the variables are displayed.
The display format of a variable may be specified using the command
format varlist %fmt
where %fmt is the chosen format for varlist. For example, format x %9.0g displays
the variable x in g (generic numeric) format, whereas format x %8.3f displays x in
f (fixed numeric) format with three decimals.
The set dp comma command may be used to display numerical values using comma
as the decimal character. To switch the decimal character back to period type set dp
period.
Changing the display format does not affect the internal precision with which variables
are stored and manipulated.

1.5.2 LABELING DATA


Stata contains a number of commands for manipulating labels.
Variables are labeled by using the command
label variable varname "string"
where string (typed in quotes) is up to 80 character long. If no label is specified, any
existing variable label is removed.
The values of a variable are labeled by using the command
label value varname [lblname]
GETTING STARTED 9

where lblname is the name of a value label defined through the command
label define lblname # "string" [# "string" ...]
Example:
label define sexlbl 0 "male" 1 "female"
label value sex sexlbl
The label dir command lists the names of value labels stored in memory, label
drop lblnames eliminates the value labels lblnames, label drop _all eliminates all
value labels, whereas label list lists the names and contents of value labels stored
in memory.
Data files are labeled by using the command
label data string
where string is up to 80 characters long. Data labels are displayed when the data are
used or described. If no label is specified, any existing label is removed.

1.5.3 SUMMARIZING DATA


The command
summarize [varlist]
calculates and displays a variety of univariate summary statistics (number of
nonmissing observations, mean, standard deviation, minimum and maximum value).
If no varlist is specified, summary statistics are calculated for all the variables in the
data.
The command
summarize [varlist], detail
produces additional statistics including skewness, kurtosis, the four smallest and four
largest values, along with various percentiles.

1.5.4 CREATING NEW VARIABLES


New variables are created using the generate command. The basic syntax of this
command is:
generate newvar = exp [options]
where exp is an expression and options are optional instructions that may restrict the
application of exp.
For example, to generate the new variable y using positive values of an existing variable
x, type:
. generate y = x*x + log(x) if x>0
where *, + and > are examples of Stata operators, log(x) is an example of a Stata
function, and if x>0 is a qualifier that restricts the scope of the command to the
observations for which x > 0.
10

To generate the new variable y containg lagged values of x, type:


. generate y = x[_n-1]
Typing generate y = x[1] sets every observation of y equal to the first observation
in x, whereas typing generate y = x[_n] and y = x are equivalent.
The Stata operators are:
• arithmetic operators: + (addition), - (subtraction), * (multiplication), /
(division), ˆ (power);

• string operators: + (string concatenation), for example the expression "abc"


+ "def" produces the string "abcdef";

• relational operators: < (less than), > (greater than), <= (less or equal), >=
(greater or equal), == (equal), ˜= (not equal);

• logical operators: & (and), | (or), ˜ (not).


The order of evaluation follows the standard rules. Parentheses may be used to force
a different order of evaluation.
Functions are used in expressions. The argument(s) of a function may be any
expression, including other functions. The arguments of a function are enclosed in
parentheses. If there are multiple arguments, they are separated by commas. Functions
return missing when the value of the function is undefined.
Stata has built in a number of functions:
• Mathematical functions, for example exp(x), log(x) or ln(x), sqrt(x),
abs(x) and the main trigonometric functions.

• Statistical functions: density, distribution functions and quantile functions


of various probability distributions, both discrete and continuous. If X
denotes the name of a continuous distribution, then Stata usually provides
X() (cumulative distribution function), Xtail() (upper tail cumulative
distribution function), invX() (quantile function) and invXtail() (upper
quantile function). The currently available distributions include chi2(df ,x)
(chi-square distribution with df degrees of freedom), F(df1 ,df2 ,x) (F
distribution with df1 and df2 degrees of freedom), norm(x) (standard
Gaussian), nchi2(df ,L,x) (noncentral chi-square distribution with df
degrees of freedom and noncentrality parameter L) and t(df ,x) (t
distribution with df degrees of freedom).
Examples:
. display Binomial(5,x,.25) (cumulative binomial with parameters
n = 5 and π = .25)
. display normden(x) (standard normal density)
. display norm(x) (cumulative standard normal)
. display invnorm(p) (standard normal quantile function)
. display binorm(x,y,.5) (cumulative bivariate Gaussian with zero
means, unit variances and correlation ρ = .5)
GETTING STARTED 11

• Pseudo-random number generator: uniform(), which generates uniformly


distributed pseudo-random numbers on the interval [0,1). It takes no
arguments, and is George Marsaglia’s KISS (Keep It Simple Stupid). Pseudo-
random numbers according to any other continuous distribution may be
generated through the inverse probability integral transform.
For example: pseudo-random numbers according to the standard normal
distribution may be generated with the invnorm(uniform()) command,
where the function invnorm evaluates the quantile function of the standard
normal.

• String functions (which apply to string variables), for example lower(s)


(returns the lowercased variant of s), real(s) (converts s into a numeric
value), string(n) (converts n into a string), substr(s, n1 , n2 ) (returns the
substring of s starting at n1 for a length of n2 ; if n2 = ., the remaining portion
of the string is returned), and upper(s) (returns the uppercased variant of
s).

• Special functions, for example float(x) (returns the value of x rounded


to float storage type), int(x) (returns the integer part of x),
max(x1 , x2 , . . . , xn ) and min(x1 , x2 , . . . , xn ) (return respectively the max-
imum and the minimum of the arguments, ignoring missing values),
round(x, y) (returns x rounded into units of y), sign(x) (returns -1 if x < 0,
0 if x = 0, 1 if x > 0, and . if x = .), and sum(x) (returns the running sum of
x, treating missing values as zero).

A variety of date and time-series functions are also available, as well as matrix functions
returning scalars (see Section 4.4.3).

1.5.5 CHANGING AND RENAMING VARIABLES


The content of an existing variable may be changed by using the replace command,
whereas the name of an existing variable may be changed (its contents remain
unchanged) by using the rename command.
The recode varname command changes the values of varname according to the rules
specified. For example,
. recode x 1=2 3=4
changes 1 in x to 2 and 3 to 4, whereas
. recode x 1 3/5 = 6
changes 1, 3, 4 and 5 to 6.
Given a string variable named varname, the command
encode varname, generate(newvar)
generates a new numeric variable named newvar based on varname, creating at the
same time (or just using as necessary) the value label newvar. Do not use encode if
12

varname contains numbers that merely happen to be stored as strings (e.g. the number
‘1,314’). In this case use instead
generate newvar = real(varname)
The decode command creates a new string variable named newvar based on the
“encoded” numeric variable varname and its value label.

1.5.6 ELIMINATING VARIABLES OR OBSERVATIONS


The drop command eliminates variables or observations from the data in memory. To
eliminate variables use
drop varlist
To eliminate observations use
drop in range [if exp]
The drop _all command eliminates all variables and observations in memory.
The keep command works the same as drop except that we specify the variables or
observations to be kept rather than those to be deleted.
The clear command essentially resets Stata and is equivalent to the set of commands:
. version 7.0
. drop _all
. label drop _all (drop all labels in memory)
. scalar drop _all (drop all scalar variables in memory)
. matrix drop _all (drop all matrices in memory)
. eq drop _all (drop all equations in memory)
. constraint drop _all (drop all constraints in memory)
. discard (drop all programs in memory)

1.5.7 INCREASING THE NUMBER OF OBSERVATIONS IN A DATASET


The set obs # command changes the number of observations in the current dataset
to #, where # is an integer at least as large as the current number _N of observations.
If there are variables in memory, the values of all new observations are set to missing.
For example,
. drop _all
. set obs 100
. gen x = _n
clears memory, makes 100 observations and assigns the variable x the values from 1
to 100.

1.6 OUTPUTTING DATA


Stata offers three basic commands for outputting data, corresponding to the use,
infile and insheet commands discussed in Section 1.4.
GETTING STARTED 13

The first command


save [filename] [, options]
stores the dataset currently in memory on disk in Stata format under the name
filename. If filename is not specified, the name under which the data was last known
to Stata is used. If filename is specified without an extension, .dta is assumed.
The available options are nolabel old replace all. The old option enables a
dataset to be readable by someone with Stata 6.0, the option replace permits save
to overwrite an existing dataset.
The second command
outfile [varlist] using filename [, options]
writes data to a disk file in ASCII (text) format. The data saved by outfile can be
read back by infile. If filename is specified without an extension, .raw is assumed
unless the dictionary option is specified, in which case .dct is assumed.
The third command
outsheet [varlist] using filename [, options]
writes data in tab- or comma-separated ASCII format into a file. This is the format
that most spreadsheet programs prefer. If filename is specified without an extension,
.out is assumed.

1.7 LOG FILES


The log command echos a copy of a Stata session to a file or a device. More precisely:
log using filename [, options]
opens the file filename and echos a copy of the Stata session to the file. If filename is
specified without an extension, .smcl is assumed (SMCL is Stata’s output language).
The available options are noproc append replace.
The log close command stops logging the session and closes the file, log off
temporarily stops logging the session leaving the file open, while log on resumes
logging to the file.
The set log command controls the dimensions of output sent to the log. Its format
is:
set {display|log} {linesize|pagesize} #
where # is the line or page length, for example set linesize 120 or set pagesize
40.
2

Stata Commands
In this chapter I describe the syntax of some frequently used Stata commands. My
selection is of course subjective.

2.1 GENERAL SYNTAX


The general syntax of a Stata command is:
command [varlist] [ = exp] [weight] [if exp] [in range] [, options]
If no varlist appears, the command assumes a varlist of _all, that is, the command
is applied to all the variables in the data.
The option = exp specifies the value to be assigned to a variable. It is most often used
with generate and replace. For example:
. replace newvar = oldvar+2
Many commands take command-specific options. A single comma separates a
command’s options from the rest of the command.
Most commands can be abbreviated. For example, one may type gen or simply g
instead of generate, summ or simply su instead of summarize, des or simply d instead
of describe, l instead of list, etc. See the on-line help or the Reference Manual for
the shortest allowable abbreviation of a command.
The F -keys may be used to create shortcuts to some command. For example, the
F 3-key comes defined as describe Enter.

2.1.1 BY
Most Stata commands allow the by varlist: prefix. This causes command to be
repeated for each subset of the data for which the values of the variables in varlist are
equal. The use of by requires the data to be preliminarily sorted by varlist.
Example:
. sort x
. by x: summarize y
Not all commands allow the by varlist: prefix. Some replace it with by(groupvar) in
the options. For example, the syntax of the ttest command is:
16

ttest varname [if exp] [in range], by(groupvar) [unequal


welch level(#)]

2.1.2 WEIGHTS
The option weight indicates the weight to be attached to each observation. The syntax
of weight is [weightword = exp], where weighword is one either weight (the default
treatment of weights) or one of fweight, pweight, aweight and iweight, which
correspond to the four kind of weights that Stata understands (although not every
command supports all four of them):

1. frequency weights (fweight) are integer-valued and indicate multiple


observations,

2. probability or sampling weights (pweight) are inversely proportional to the


sample inclusion probabilities,

3. analytic weights (aweights) are inversely proportional to the variance of an


observation,

4. importance weights (iweights) indicate the relative “importance” of an


observation.

The default treatment (weight) is each command’s idea of what the “natural” weights
are and is one of the above weight types.

2.1.3 IF AND IN
The if exp qualifier restricts the scope of the command to those observations for
which the value of the expression is true. For example:
. replace y = x+2 if x>0
The in range qualifier restricts the scope of the command to a specific observation
range, where range is any of #, #/#, #/l or f/#.
For example, to summarize the values of x and y for the first 10 observations:
. summarize x y in 10
. summarize x y in 1/10
. summarize x y in f/10

2.1.4 QUIETLY AND NOISILY


Typing quietly command suppresses all terminal output for the duration of
command, noisily command turns back on terminal output, if appropriate, for the
duration of command.
Example:
. quietly by x: generate y = sum(z)
STATA COMMANDS 17

2.2 BASIC DATA COMMANDS


I discuss nine basic commands: describe, list, drop and keep, generate and its
extension egen, replace, sort and gsort.

2.2.1 DESCRIBE
This command displays a summary of the contents of either the data in memory or
the data stored in a Stata-format dataset. Its syntax is
describe [varlist] [, short detail fullnames numbers]
in the first case, and
describe using filename [, short detail]
in the second case, where short suppresses the specific information about each
variable, detail includes more detailed information (the width of a single observation,
the maximum number of observations holding the number of variables constant, the
maximum number of variables holding the numbers of observations constant, the
maximum width for an observation, and the maximum size of the dataset), fullnames
displays the full names of the variables (the default is to present an abbreviation when
the variable name is longer than 15 characters), and numbers presents the variable
number along with the variable name. The numbers and fullnames options may not
both be specified together.

2.2.2 LIST
This command displays the values of variables. Its syntax is:
list [varlist] [if exp] [in range] [, [no]display nolabel noobs]
where [no]display forces the format into display or tabular (nodisplay) format (if one
of these two options is not specified, then Stata chooses one based on its judgment),
nolabel causes the numeric codes rather than label values to be displayed, and noobs
suppresses printing of the observation numbers.
Examples:
. list in 1/10
. list x y
. list x y in 1/10
. list if x>20
. list x y if z>20
. list x y z if z>20 in 1/10

2.2.3 DROP AND KEEP


The drop command eliminates variables or observations from the data in memory. Its
syntax is:
drop varlist
18

drop if exp
drop in range [if exp]
The keep command works exactly the same as drop except that one specifies the
variables or observations to be kept.
Examples:
. drop in 1/33
. keep in 34/l (drop first 33 observations)
. drop in -10/l (drop last 10 observations)
. drop if x<21
. keep if x>=21
. sort y
. by y: keep if _n==_N

2.2.4 GENERATE AND EGEN


The generate command creates a new variable. Its syntax is:
generate [type] newvar[:lblname] = exp [if exp] [in range]
If type is not specified, float is the default (the default type may be changed using the
set type command). If missing values are generated, the number of missing values in
newvar is always reported.
To prevent Stata from returning an error when string variables are generated, type
must be set to str#.
Examples:
. generate x2 = x*x
. generate bigz = z>100000 & z˜=.
. gen double w = x/y
. gen xlag = x[_n-1]
. gen u = uniform() (U(0, 1) pseudo-random numbers)
. gen z = invnorm(uniform()) (N (0, 1) pseudo-random numbers)
The egen command provides an extension to generate. Its syntax is:
egen [type] newvar = fcn(stuff) [if exp] [in range] [, options]
egen creates newvar equal to fcn(stuff). Depending on fcn(), stuff refers to an
expression, a list of variables, or a list of numbers. The options are similarly function
dependent. Note that egen may change the sort order of the data.
Important examples of egen functions include:

• count(exp) [, by(varlist)] creates a constant (within varlist) containing


the number of nonmissing observations of exp.

• diff(varlist) creates an indicator variable equal to 1 where the variables in


varlist are not equal and 0 otherwise. It may not be combined with by.
STATA COMMANDS 19

• group(varlist) [, missing label truncate(num)] creates a single vari-


able taking on values 1,2,. . . for the groups formed by varlist. It may not
by combined with by. The label option returns integers from 1 up according
to the distinct groups of varlist in sorted order. The integers are labeled with
the values of varlist, or the value labels if they exist. The truncate() option
truncates the values contributed to the label from each variable in varlist to
the length specified by the integer argument num.

• iqr(exp) [, by(varlist)] creates a constant (within varlist) containing the


interquartile range of exp. The same syntax holds for a number of other
functions with argument exp, such as kurt (coefficient of kurtosis), mad
(median absolute deviation from the median), max) (maximum value), mean
(mean), median (median), medv (mean absolute deviation from the mean), min
(minimum value), sd (standard deviation), skew (coefficient of skewness), sum
(sum).

• ma(exp) [, t(#) nomiss] creates a #-period moving average of exp. If t()


is not specified, t(3) is assumed. Notice that # must be odd and exp must
not produce missing values.

• pctile(exp) [, p(#) by(varlist)] creates a constant (within varlist)


containing the #-th percentile of exp. If p() is not specified, 50 is assumed,
meaning medians.

• rmax(varlist) gives the maximum value in varlist for each observation (row).
It may not be combined with by. The same syntax holds for a number of
other functions with argument varlist, such as rmean (row mean), rmin (row
minimum), rmiss (row number of missing values).

• std(exp) [, mean(#) std(#)] creates the standardized value of exp using


the specified mean and standard deviation. The default is mean() and std()
producing a variable with zero mean and unit variance.

Examples:
. egen avgx = mean(x)
. gen dev = x-avgx
. egen x = median(x2-x1) (expression, - means subtraction)
. egen y = rmean(x1 x2 x3)
. egen y = rmean(x1-x3) (varlist, - means through)
. egen sdx = sd(x)
. egen stdx = std(x), mean(100) std(10)
. egen sumx = sum(x), by(y)
. egen xy = group(x y)

2.2.5 REPLACE
This command changes the contents of an existing variable. Its syntax is:
20

replace oldvar = exp [if exp] [in range] [, nopromote]


where nopromote prevents replace from promoting the variable type to accommodate
the change.
Examples:
. replace z=. if z<=0
. replace y = 25 in 1007
. sort z
. by z: gen avgx = sum(x)/sum(x˜=.)
. by z: replace avgx = avgx[_N]

2.2.6 SORT AND GSORT


The sort commmad arranges the observations of the current data in ascending order
of the values of the variables in varlist. Its syntax is
sort varlist [in range]
There is no limit to the number of variables in varlist and each variable can be numeric
or string. Missing values are interpreted as being larger than any other number and
are thus placed last (there is an exception: When sorting on a string variable, null
strings are placed first).
The dataset is marked as being sorted by varlist unless in range is specified.
Examples:
. sort personid
. sort lstname frstname midinitl
Unlike sort, that can produce only ascending-order arrangements, gsort may arrange
the observations in either ascending or descending order. Its syntax is
gsort [+|-]varname [[+|-]varname [...]] [, generate(newvar) mfirst]
The observations are placed in ascending order of varname if + or nothing is typed in
front of the name and in descending order if - is typed.
The generate(newvar) option creates newvar containing 1,2,3,. . . , for each of the
groups denoted by the ordered varnames. This is useful when one wishes to use the
ordering with a subsequent by. The mfirst option specifies that missing values are to
be placed first in descending orderings rather than last.
Examples:
. gsort x (same as sort x)
. gsort +x (same as gsort x)
. gsort -x (reverse sort)
. gsort -name (reverse alphabetical)
. gsort x y (ascending x, ascending y)
. gsort x -y (ascending x, descending y)
STATA COMMANDS 21

. gsort -x, gen(revx)


. quietly by revx: gen rcum = _N if _n==1
. replace rcum = sum(rcum)
. replace rcum = rcum/rcum[_N]

2.3 COMBINING DATA


I discuss two commands: append and merge.

2.3.1 APPEND
This command appends a Stata-format dataset stored on disk to the end of the dataset
in memory. Its syntax is:
append using filename [, nolabel]
where nolabel prevents copying the value label definitions from the disk dataset. Even
if this option is not specified, label definitions from the disk dataset never replace
definitions already in memory.
If filename is specified without an extension, .dta is assumed.

2.3.2 MERGE
This command joins corresponding observations from the dataset currently in memory
(called the master dataset) with those from the Stata-format dataset stored as filename
(called the using dataset) into single observations (if filename is specified without an
extension, .dta is assumed). It can perform both one-to-one and match merges. Its
syntax is:
merge [varlist] using filename [, nolabel update replace
nokeep _merge(varname)]
where nokeep causes merge to ignore observations in the using data that have no
corresponding observation in the master (the default is to add these observations to
the merged result and mark them with _merge==2) and _merge(varname) specifies
the name of the variable that will mark the source of the resulting observation. The
default is _merge(_merge), which adds a new variable _merge to the data whose
values are:
_merge==1 (obs. from master data)
_merge==2 (obs. from using data)
_merge==3 (obs. from both master and using data)
Examples:
. use data1 (one-to-one merge)
. merge using data2
. tab _merge
. use data2 (match merge)
. sort x
22

. save data2, replace


. use data1
. sort x
. merge x using data2
. tab _merge

2.4 RESHAPING DATA


I discuss five commands: collapse, contract, expand, fillin and reshape.

2.4.1 COLLAPSE
This command replaces the data in memory with a new dataset consisting of the
means, medians, etc. of the specified variables. Its syntax is:

collapse clist [weight] [if exp] [in range] [, by(varlist) cw fast]

where clist is either

[(stat)] varlist [[(stat)] ...]

[(stat)] target_var=varname [target_var=varname ...] [[(stat) ...]

or any combination of the varlist or target_var forms, and stat is one of the following:
mean (means), sd (standard deviations), sum (sums), rawsum (sums ignoring optionally
specified weights), count (number of nonmissing observations), max (maxima), min
(minima), median (medians), p# (#th percentile), iqr (interquartile range). If stat is
not specified, mean is assumed.

The by(varlist) option specifies the groups over which the means, etc., are to be
calculated, cw specifies casewise deletion (if not specified, all observations possible are
used for each calculated statistic) and fast specifies that collapse not go to extra work
so that it can restore the original data should the user press Break.

2.4.2 CONTRACT
This command makes datasets of frequencies. It replaces the data in memory with a
new dataset consisting of all combinations of varlist that exist in the data together
with a new variable that contains the frequency of each combination. Its syntax is:

contract varlist [weight] [if exp] [in range] [, freq(varname)


zero nomiss]

where freq(varname) specifies a name for the frequency variable (if not specified,
_freq is used, the name must be new), zero specifies that combinations with frequency
zero are wanted, and nomiss specifies that observations with missing values on any of
the variables in varlist will be dropped (if not specified, all observations possible are
used).
STATA COMMANDS 23

2.4.3 EXPAND
This command replaces each observation in the current dataset with n copies of the
observation, where n is equal to the integer part of the required expression (if the
expression is less than one or equal to missing, then it is interpreted as if it were one,
and the observation is retained but not duplicated). Its syntax is:
expand [=]exp [if exp] [in range]
Example:
. expand 2

2.4.4 FILLIN
This command rectangularizes a dataset by adding observations with missing data
so that all interactions of the variables in varlist exist. It also adds the variable
_fillin to the data (with value 1 for created observations and 0 for previously existing
observations). Its syntax is:
fillin varlist

2.4.5 RESHAPE
This command converts data from wide to long form and vice versa. Its basic syntax
is:
reshape wide varnames, i(varlist) [j(varname) string]
reshape long varnames, i(varlist) [j(varname) string]
where i(varlist) specifies the variable(s) whose unique values denote a logical
observation, j(varname) specifies the variable whose unique values denote a
subobservation, and string specifies that the j() may contain string values.
Examples:
. reshape long x1 x2, i(y) j(z) (converts from wide to long)
. reshape wide (converts back to wide)
. reshape ..., i(z) (single i() variable)
. reshape ..., i(z1 z2) (two i() variables)
. reshape long x, i(y) j(z 1-3 5) (specifying j() values)
. reshape long x, i(y) j(z) string (allow string variables in j())

2.5 BASIC SAMPLE STATISTICS


I discuss seven commands: count, summarize, means, centile, cumul, correlate and
regress.

2.5.1 COUNT
This command counts observations satisfying the specified conditions. Its syntax is:
24

count [if exp] [in range]


If no condition is specified, count displays the number of observations in the dataset.
Examples:
. count if y<0
. by x: count if y<0

2.5.2 SUMMARIZE
This command reports a variety of univariate summary statistics. Its syntax is:
summarize [varlist] [weight] [if exp] [in range] [,
{detail|meanonly} format]
where detail produces additional statistics (including skewness, kurtosis, the four
smallest and four largest values, along with various percentiles), meanonly suppresses
display of the results and calculation of the variance (it is allowed only when detail
is not specified) and format requests that the summary statistics be displayed using
the display format associated with the variables rather than the default g format.

2.5.3 MEANS
This command reports the arithmetic, geometric, and harmonic means, along with
their respective confidence intervals, for the specified variables. Its syntax is:
means [varlist] [if exp] [in range] [, add(#) only level(#)]
where add(#) adds the value # to each variable in varlist before computing the means
and confidence intervals (this may be useful when analyzing variables with nonpositive
values), only modifies the action of the add() option (if specified, the add() option
only adds # to variables with at least one nonpositive value) and level(#) specifies
the percentage confidence level for confidence intervals.
The ci command may be used if one simply wants arithmetic means and corresponding
confidence intervals.

2.5.4 CENTILE
This command reports the (per)centiles of the specified variables and their confidence
intervals. By default, confidence intervals are obtained using a binomial method that
makes no assumptions as to the underlying distribution of the variable. The syntax is:
centile [varlist] [if exp] [in range] [, centile(numlist) cci
normal meansd level(#)]
where centile(numlist) specifies the centiles to be reported, for example centile(25
50 75) (if not specified, medians are reported), cci (conservative confidence interval)
prevents centile from interpolating when calculating the distribution-free (binomial-
based) confidence limits, normal specifies that confidence intervals are to be obtained
assuming that both the data and the centiles are normally distributed, meansd
STATA COMMANDS 25

calculates confidence intervals assuming that the estimated centiles themselves are
normally distributed.
The related command
pctile newvar = exp
creates a new variable containing the percentiles of exp, where exp is typically just
another variable.

2.5.5 CUMUL
This command creates a new variable containing the empirical distribution function
(edf) of a variable. Its syntax is:
cumul varname [weight] [if exp] [in range] , gen(newvar) [freq
by(varlist)]
where gen(newvar) specifies the name of the new variable to be created (it is not
optional), freq requests the edf to be in frequency units; otherwise it is normalized so
that newvar is 1 for the largest value of varname, and by(varlist) specifies that edf’s
be generated separately for each by-group.

2.5.6 CORRELATE
This command reports the covariance or correlation matrix of the specified variables.
Observations are excluded from the calculation due to missing values on a casewise
basis. The syntax is:
correlate [varlist] [weight] [if exp] [in range] [, means
noformat covariance wrap]
where means causes summary statistics (means, standard deviations, minima and
maxima) to be displayed along with the matrix, noformat displays the summary
statistic requested by the means option in g format regardless of the display formats
associated with the variables, covariance displays the covariances rather than the
correlation coefficients, and wrap requests that no action be taken on wide matrices
to make them readable.

2.5.7 REGRESS
This command estimates linear regression models with a single response or dependent
variable. Estimation is carried out by least squares (either ordinary least squares or
weighted least squares). Its basic syntax is:
regress yvar [xvars] [weight] [if exp] [in range] [, level(#)
noconstant regress_options]
where level(#) specifies the confidence level (in percent) for the regression
parameters (the default is 95%), noconstant suppresses the constant term (intercept)
in the regression, and the additional regress_options are described in more detail in
Section 6.1.1.
26

2.6 TABLES
Stata offers three basic commands for producing tables: table, tabulate and
tabulate, summarize.

2.6.1 TABLE
This command provides tables of summary statistics. Its syntax is a little involved:
table rowvar [colvar [supercolvar]] [weight] [if exp] [in range]
[, contents(clist) by(superrow_varlist) cw row col scol
format(%fmt) center left concise missing replace name(string)
cellwidth(#) csepwidth(#) scsepwidth(#) stubwidth(#)]
where contents(clist) specifies the content of the table’s cells (up to 5 statistics may
be specified, if contents() is not specified it is assumed to be contents(freq)), clist
is as in collapse, row specifies a row is to be added to the table reflecting the total
across rows, col specifies a column is to be added to the table reflecting the total
across columns, format(%fmt) specifies the display format for presenting numbers in
the table’s cells, center specifies results are to be centered in the table’s cells (the
default is to right align), left specifies that column labels are to be left aligned (the
default is to right align), missing specifies that missing statistics are to be shown in
the table as periods (the default is to leave them blank). See the on-line help or the
Reference Manual for a description of the other options.

2.6.2 TABULATE
This command provides one- and two-way tables of frequency counts along with
various measures of association, including the common Pearson chi-squared, the
likelihood ratio chi-squared, Cramer’s V, Fisher’s exact test, Goodman and Kruskal’s
gamma, and Kendall’s tau-b.
The syntax for one-way tables is:
tabulate varname [weight] [if exp] [in range] [, generate(varname)
matcell(matname) matrow(matname) missing nofreq nolabel
plot subpop(varname)]
The syntax for two-way tables is:
tabulate varname1 varname2 [weight] [if exp] [in range] [,
all cell chi2 column exact gamma lrchi2 matcell(matname)
matcol(matname) matrow(matname) missing nofreq nolabel
row taub V wrap]
where all is equivalent to specifying chi2 lrchi2 V gamma taub, cell displays
the relative frequency of each cell in a two-way table, chi2 calculates and displays
Pearson’s chi-squared for the hypothesis that the rows and columns in a two-way
table are independent, column displays in each cell of a two-way table the relative
frequency of that cell within its column, exact displays the significance calculated
by Fisher’s exact test, gamma displays Goodman and Kruskal’s gamma along with
STATA COMMANDS 27

its asymptotic standard error, generate(varname) creates a set of indicator variables


reflecting the observed values of the tabulated variable, lrchi2 displays the likelihood-
ratio chi-squared statistic (the request is ignored if any cell of the table contains
no observations), matcell(matname) saves the reported frequencies in the matrix
matname, matcol(matname) saves the numeric values of the column stub in the
vector matname, matrow(matname) saves the numeric values of the row stub in
the vector matname, missing requests that missing values be treated like other
values in calculations of counts, percentages, and other statistics, nofreq suppresses
printing the frequencies, nolabel causes the numeric codes to be displayed rather
than the value labels, plot produces a bar chart of the relative frequencies in a one-
way table, replace indicates that the immediate data specified as arguments to the
command are to be left as the current data in place of whatever data was there, row
displays in each cell of a two-way table the relative frequency of that cell within its
row, subpop(varname) excludes observations for which varname = 0 in tabulating
frequencies, taub displays Kendall’s tau-b along with its asymptotic standard error,
and V (note capitalization) displays Cramer’s V.

2.6.3 TABSUM
The tabulate, summarize() command produces one- and two-way tables of
summary statistics. Although table is better, tabulate, summarize() is faster. Its
syntax is:
tabulate varname1 [varname2] [weight] [if exp] [in range] ,
summarize(varname3) [[no]means [no]standard [no]freq [no]obs
wrap nolabel missing]
where summarize(varname3) identifies the name of the variable for which summary
statistics are to be reported (if this option is not specified, then a table of frequencies
is produced), [no]means includes only or suppresses only the means from the table
(the summarize() table normally includes the mean, standard deviation, frequency,
and, if the data is weighted, the number of observations), [no]standard includes only
or suppresses only the standard deviations from the table, [no]freq includes only or
suppresses only the frequencies from the table, [no]obs includes only or suppresses
only the reported number of observations from the table, and missing requests that
missing values of varname1 and varname2 be treated as categories rather than as
observations to be omitted from analysis.
Examples:
. tabulate y, summarize(z) (one-way table)
. tabulate y1 y2, summarize(z) (two-way table)
. sort x (n-way table)
. by x: tabulate y1 y2, summarize(z) means nofreq
3

Graphics
Stata graphics are not very fancy, but allow considerable flexibility and are relatively
simple to use.

3.1 BASIC SYNTAX AND GRAPHIC STYLES


The basic syntax of the graph command is:
graph [varlist] [weight] [if exp] [in range] [, options]
Typed without arguments, graph recalculates and redisplays the last graph. The
syntax to review a saved Stata graph is:
graph using filename [filename] [, options]
An existing Stata graph can be translated to another format (e.g. PostScript) using
the translate command and printed using the print command.
Examples:
. translate mygraph.gph mygraph.eps (converts to Encapsulated Post-
Script)
. translate mygraph.gph mygraph.wmf (converts to Windows metafile)
. translate mygraph.gph mygraph.prn (converts to printer format)
. print mygraph.gph (print mygraph)
. print @Graph (print the graph in the Graph window)
Notice that graphs may also be produced by other Stata commands such as kdensity
(nonparametric density estimation), ksm (regression smoothers) and logistic (logistic
regression diagnostic plot).
Stata offers eight basic graph styles:
1. histogram,

2. twoway (two-way scatterplots),

3. matrix (two-way scatterplot matrices),

4. box (box plots),

5. oneway (one-way scatterplots),


30

6. star (star charts),

7. bar (bar charts),

8. pie (pie charts).

After discussing some options of the graph command that are common across all styles
(Section 3.2), I shall focus on the first four styles.

3.2 COMMON GRAPH OPTIONS


In this section I briefly discuss some of the general options of the graph command.
I will henceforth refer to these options as common_options.

• Saving a graph to disk: saving(filename [, replace]). If an extension is


not specified, .gph is assumed.

• Printing a graph: after the graph command, use the Print button in the Stata
toolbar.

• Multiple-imaging options: by(varname) is allowed for all styles except matrix


and star. It requests that graphs be drawn separately for the groups defined
by varname and be combined into a single image.

• Specifying titles: graph allows up to two titles on every side of the graph (top,
bottom, left and right), denoted by the options t1, t2, b1 (same as title or
ti), b2, l1, l2, r1, and r2. The first title (e.g. t1) is always the farther from
the figure, the second (e.g. t2) is the closest. The argument of each option
is some text enclosed in quotes. Quotes can be omitted if text contains no
special character.
Example:
. graph y x, l1(y) b2(x) title("Figure 1: x-y scatterplot")

• Setting the gap: gap(#) sets the amount of space between the left title and
the values along the y-axis. The default is gap(8).

• Labeling axes: by default, graph labels just the minimum and maximum
of each variable. More aesthetically pleasing results may be obtained
with the options {x|y|r|t}label[(#,...,#)]. Typed without arguments,
{x|y|r|t}label chooses “round” values to be labelled.

• Adding ticks: graph automatically places tick marks on axes anywhere


they are labelled. Additional ticking may be obtained with the options
{x|y|r|t}tick[(#,...,#)].

• Adding lines: lines across the graph may be drawn with the options
{x|y|r|t}line[(#,...,#)]. The yline and rline options draw horizontal
lines, xline and tline draw vertical lines.
GRAPHICS 31

Example:
. graph y x, gap(4) xlabel ylabel(0,1) ytick(.5) yline(0,.5,1)

• Setting the scale: by default, graph scales each axis according to the minimum
and maximum of all things that go on the axis (data, labeling or ticking). The
options {x|y|r}xscale(#,#) may be used to widen (but never to narrow)
the scale used for drawing a graph on any style that has an axis.

• Setting the axes rendition: by default, graph draws an axis on any style that
has an axis. The border option replaces axes with borders. The noaxis option
suppresses both axes and borders.

• Creating log scales: the log option is used with the histogram style, the
{x|y|r}log options with the twoway style.

• Plotting symbols: graph uses the following plotting symbols to specify the
location of a point on a scatterplot: O (large circle, default for twoway), S
(large square), T (large triangle), o (small circle, default for twoway with by, or
matrix), d (small diamond), p (small plus), . (dot), i (invisible), [varname]
(variable to be used as text), _n (observation number).
The sequence of plotting symbols for the variables in varlist is specified with
the option symbol(s . . . s), where s is any of the above symbols. By default,
twoway chooses the symbols O, T, S, and the remainder . if symbol is not
specified. Combined with by(), it chooses instead o, p, d, and the remainder
. by default.

• Connecting points: graph offers the following alternatives to connect points on


a scatterplot: . (do not connect, default), l (straight lines between points), L
(straight line between ascending x-points), m (connect median bands using
straight lines), s (connect median bands using cubic splines), J (connect
rectilinearly making steps), || (connect two variables vertically (high-low)),
II (same as || but cap bottom and top of line).
How the variables in varlist are connected is specified by the option connect(s
. . . s), where s is any of the above alternatuves. If connect is not specified,
point are not connected.
The connect option connects points in the order of the data, not the order
of the x-axis. graph includes a sort option that automatically sorts the data
according to the x-axis before graphing.

• Line patterns: For each line type, one can specify the pattern of the line by
adding a [pattern] after the line type, where pattern is any combination of
the following: l (a solid line, the default), _ (a long dash), - (a medium dash),
. (a short dash, almost a dot) and # (a space).
Example:
. graph y1 y2 y3 x, symbol(...) connect(ll[_]l[-])
32

The set textsize # command controls the size of the text used in a graph. This is
not an option but a separate command that must be issued before graph.

I refer to the Graphics Manual for other common options and to the remainder of this
chapter for options specific to the various graph styles.

3.3 HISTOGRAMS
This is the default for graph when only one variable is specified. The basic syntax is:

graph [variable] [weight] [if exp] [in range], histogram


[common_options bin(#) {freq|percent} normal[(#,#)]
density(#)]

where bin(#) specifies the number of (equally spaced) bins to use for constructing
the histogram (the default is bin(5)), freq and percent affect how the vertical
axis is labeled (respectively, in frequency units and in percent), normal[(#,#)]
overlays a normal density with specified mean and standard deviation (normal by
itself uses the observed mean and standard deviation), and density(#) (only used
with normal) specifies the number of points along the density to be calculated (the
default is density(100)).

Examples:

. graph x (draws a histogram of x)


. graph x, bin(15) (uses 11 bins for histogram)
. graph x, normal(10,3) (overdraws a normal density with mean 10 and
std. dev. 3)

3.4 TWO-WAY SCATTERPLOTS


This is the default for graph when more than one variable is specified. twoway may be
combined with oneway or box, but in that case, one must specify twoway explicitly.

The basic syntax is:

graph [yvars xvar] [weight] [if exp] [in range], twoway


[common_options rescale rbox {y|x|r}reverse]

where rescale scales each y-variable independently (if there are two y-variables, the
scale of the first is presented on the left axis and the scale for the second on the right
axis; if there are more than two y-variables, no vertical scale is labeled), rbox places a
rangefinder box plot on the graph, and {y|x|r}reverse reverses the indicated scale
to run from high-to-low.

Examples:

. graph y x (graph of y against x)


. graph z y x, rescale (graph of z and y against x)
GRAPHICS 33

3.5 TWO-WAY SCATTERPLOT MATRICES


A two-way scatterplot matrix is a set of two-way scatterplots arranged in a matrix.
The basic syntax is:
graph [varlist] [weight] [if exp] [in range], matrix
[common_options half]
where half draws only the lower half of the matrix.
Example:
. graph x y z if z>0, matrix

3.6 BOX PLOTS


A box plot is a graphical procedure with the following features: (i) it combines a
measure of location (the median) and a measure of spread (the interquartile range),
(ii) it shows the presence of possible outliers, and (iii) it provides some indication about
the shape of the distribution of the data in terms of their symmetry or skewness. The
basic syntax is:
graph [varlist] [weight] [if exp] [in range], box
[common_options [no]alt vwidth root]
where [no]alt forces the labeling of the groups to be on single line (noalt) or multiple
lines, vwidth makes the width of the box proportional to the number of observations,
and root (only used with vwidth) makes the width of the box proportional to the
square root of the number of observations.
Examples:
. graph y x, box (graphs box-and-whiskers for y and x)
. graph y, box by(z) (graphs box-and-whiskers for y by z groups)
. graph y x, box by(z) (graphs y against x by z)
4

Programming and Matrix


Commands
In this note I discuss the elements of Stata programmming and Stata matrix language.

4.1 PROGRAMMING STATA


The capabilities of Stata may be extended considerably by using programs. A Stata
program is just a sequence of Stata commands enclosed between the commands
program define progname and end. Thus, the general structure of a Stata program
is
program define progname
Stata commands
end
Programs must be defined (loaded in memory) before they can be used. The simplest
way to do so is to type directly the commands from the keyboard. This is not
recommended, however, unless the program is short. Alternative ways of defining
programs are described in Sections 4.2 and 4.3.
Programs are executed by typing progname. Displaying of the underlying commands
is suppressed.
Programs may call other programs. Stata allows programs to be nested 32 deep.

4.1.1 MACROS
A macro is a user-defined string of characters, called the macro name, that stands for
another string of characters, called the macro content.
Stata has two types of macros, local and global. Local macros are private, that is,
specific to the program where they are defined, global macro are public. Their content
is set respectively by the local and global commands. Their general syntax is
{local|global} mname [[`]"[string]"[´]|= exp|: extended_fcn]
where the macro name mname can be up to 7 character long for local macros and
up to 8 characters for global macros, and exp may be either a numeric or a string
expression. For the use of a extended macro function see the Stata manual.
36

To copy string to mname (the maximum length of string is 18,623 characters) use:
{local|global} mname "string"
To evaluate exp and store the result in mname (the maximum length of exp is 80
characters) use:
{local|global} mname = exp
Macros can be used everywhere in programs, do-files and ado-files (see below). The
content of a local macro is accessed by enclosing the macro in `´, that of a global
macro by prefixing it with a $. This simply replaces the name of the macro with its
content.
Examples:
local options "gap(4) sy(i) xlab(10) ylin"
graph x y, `options´
sort z
global options "by(z) gap(4) sy(.) c(l) xlab ylab"
graph x y, $options
If a macro contains double quotes, compound double quotes `""´ may be used to
define a macro.
Typing macro drop mname eliminates the global macro mname. Typing macro drop
_all eliminates all global macros.

4.1.2 SYSTEM MACROS


In addition to user-defined macros, Stata has number of built-in global system macros
that begin with the characters S_.
Examples:
$S_DATE : contains the current date in the format dd mon yyyy
$S_TIME : contains the current time in the format hh:mm:ss
$S_FN : contains the filename last specified with use or save
User-written programs may examine and change the content of system macros. Typing
macro drop _all does not eliminate system macros and the content of system macros
such as S_DATE and S_TIME cannot be changed.

4.1.3 LOOPING
Stata provides two commands for looping, while and forvalues. The syntax of while
is simpler:
while exp {
Stata commands
}
PROGRAMMING AND MATRIX COMMANDS 37

This command evaluates exp and, if it is true (nonzero), executes the commands
enclosed in the braces. It then repeats the process until exp evaluates to false (zero).
whiles may be nested within whiles. If exp refers to any variables, their values in
the first observation are used unless explicit subscripts are specified.
Example: The following code fragment may be used to iterate Stata commands 10
times
local i = 1
local I = 10
while `i´<=`I´ {
Stata commands
local i = `i´+1
}
The while command may also be used interactively.

4.1.4 BRANCHING
The syntax of this programming command is:
if exp {
Stata commands
}
else { other Stata commands }
This command evaluates exp. If the result is true (nonzero), the commands inside the
braces are executed. If the result is false (zero), those statements are ignored and the
statements following the else, if specified, are executed.
Do not confuse this command with the if qualifier at the end of a command.
Example:
if x>0 {
replace y = log(x)
}
else if x<0 {
replace y = log(-x)
}
else {
replace y = .
}

4.1.5 PROGRAM ARGUMENTS


Programs may take arguments, just like functions. Unlike functions, however, the
arguments of a program are not enclosed in parentheses but simply follow the program
name.
38

For example, if prog1 is a program and we type


prog1 x y
then x and y are the program’s arguments, respectively the first and the second
argument.
Arguments are passed to programs via local macros: `0´, `1´, `2´, . . . , where the local
macro `0´ is exactly what the user typed, `1´ is the first argument of the program,
`2´ the second argument, etc.
The positional macros `1´, `2´, etc., may be renamed to facilitate reading and
understanding of a program. Thus, for example, the following two programs both
produce a sequence of n pseudo-random numbers according to the U(a, b) distribution:
program define prog1
drop _all
set obs `1´
generate x = `2´+(_n-1)/(_N-1)*(`3´-`2´)
end
program define prog2
args n a b
drop _all
set obs `n´
generate x = `a´+(_n-1)/(_N-1)*(`b´-`a´)
end
Sometimes programs involves a variable number of arguments, with the same thing
done to each argument. An example is the summarize command, which may be applied
to one, two or more variables.
Programs with this feature may be coded by shifting through its arguments
program define myprog
while "`1´" ˜= "" {
Stata commands in terms of `1´
macro shift
}
end
where macro shift shifts `1´, `2´, `3´, . . . , one to the left: what was `1´ disapears,
what was `2´ becomes `1´, what was `3´ becomes `2´, etc. The outer while loop
continues the process until macro `1´ is empty.
An alternative is the following:
program define myprog
local i = 1
while "``1´´" ˜= "" {
Stata commands in terms of `1´
PROGRAMMING AND MATRIX COMMANDS 39

local i = `i´+1
}
end

4.1.6 TEMPORARY OBJECTS


Programs often require objects (variables, scalars, matrices, data, etc.) that are
temporary, that is, can be discarded once the program completes.
Stata provides three commands to deal with this: tempvar creates names for temporary
variables, tempname creates names for temporary scalars and matrices, and tempfile
creates names for temporary files. They all have the same syntax:
{tempvar|tempname|tempfile} mname [mname . . . ]
The command creates local macros containing names one may use.
Example:
...
tempvar x y
gen `x´ = exp
gen `y´ = exp
...
The drop `x´ `y´ command is not necessary when the program completes, because
Stata automatically drops any variables with names assigned by tempvar.

4.1.7 EXCHANGING RESULTS BETWEEN PROGRAMS


Stata commands that report results save them in places where they can be
subsequently used by other commands or programs. Most commonly, commands save
results in one of two places:

1. r-class commands (such as summarize) save their results in r(),

2. e-class (estimation) commands (such as regress) save their results in e().

Typing return list after an r-class command or estimates list after an e-class
(estimation class) command summarizes what the command saved.
Results saved in r() and e() come in three flavors: scalars, macros and matrices. For
example, the number of observations used by a command are saved in the scalars r(N)
or e(N). After an e-class command, the command name and the name of the response
(dependent) variable are saved in the macros e(cmd) and e(depvar), whereas the
estimated coefficients and their variance matrix are saved in the matrices e(b) and
e(V).
Regardless of their flavor, one may refer to saved results in two ways. One is just by
simply typing r(name) or e(name). The other is to use macro substitution characters
to produce `r(name)´ or `e(name)´.
40

Example: After regress


. display "You can refer to " e(cmd) " or to `e(cmd)´"
You can refer to regress or to regress
Notice that after running an r-class command, running another one would change the
content of r() but not the content of e(). On the other hand, running a new e-class
command may change the content of both e() and r(). Thus, if one wants to access
the results produced by a command, it is important to do so immediately.
User-defined programs may save their results if their class is specified on the program
define line through the option rclass or eclass, depending on whether the program
is intended to be r- or e-class. The code to save results in r() is
return scalar name = exp
return local name ...
return local name matname
while the code to save results in e() is
estimates scalar name = exp
estimates local name ...
estimates local name matname

4.2 DO FILES
A do-file is a standard ASCII (text) file containing a sequence of Stata commands, a
separate command on each line. The sequence of commands is executed using the do
or run commands, whose syntax is
{do|run} filename [arguments] [, nostop]
where nostop allows the do-file to continue executing even if an error occurs. If
filename is specified without an extension, .do is assumed.
The difference between do and run is that do echos the commands and their output,
while run is silent.
A do-file completes the execution when: (i) the end of the file is reached, (ii) an exit is
executed, or (iii) an error (nonzero return code) occurs (pressing Break while executing
a do-file causes a nonzero return code and therefore stops the do-file).
A do-file may be used to define one or more programs or may call programs already
defined. Do-files may also call other do-files. As for programs, Stata allows do-files to
be nested 32 deep.
Here are some rules and recommendations for constructing a do-file.

• Start a do-file by typing version #, where # is the Stata release under which
the file was written. This allows the do-file to run under later releases.

• Blank lines and comments may be included freely. Their proper use may
considerably enhance understanding of a program.
PROGRAMMING AND MATRIX COMMANDS 41

• Comments may be included either by beginning a line with a ‘*’ (a star), or


by placing the comment in /* */ delimiters. The /* */ delimiters can be put
anywhere, at the end of a line or even in the middle.
Example:
version 7.0 /* do-file written under Stata 7.0 */
* read in the data
use mydata, clear
* summary statistics
summarize x y z, detail

• To avoid lines wider than the screen, the end-of-line delimiter may be changed
from carriage return to, say, ‘;’ by including the #delimit ; command. The
delimited may later be changed back to carriage return by including the
#delimit cr command.

• The output of a do-file may be sent to a log file by including the command
log using filename [, replace]. Logging stops and the log file is closed
when log close is encountered. Output to the log file is suppressed if run is
used to execute a do-file.

• To prevent Stata from pausing when the screen is full, include the set more
off command.

Do-files accept arguments, just like programs. Arguments are stored in local macros
`1´, `2´, and so on. For example, to repeat the same set of instruction for different
variables one could write the do-file try.do
use mydata, clear
drop if `1´==.
summarize `2´, detail
and then execute it by typing
. do try x y
The second command (drop if `1´==.) would be interpreted as drop if x==.
because x is the first argument typed after do try, the third command (summarize
`2´) would be interpreted as summarize y because y is the second argument typed
after do try.

4.3 ADO FILES


An ado-file defines a Stata command, although many commands (e.g. summarize or
regress) are not defined by ado-files but are build directly into Stata.
An ado-file is an ASCII (text) file that contains a Stata program which defines
(implements) a command. For example, the ci command produces confidence intervals
and is implemented as an ado-file. This means that a file called ci.ado is stored on
some directory that Stata can access.
42

Ado-files typically come with an associated help-file. Typing help ci (or pulling down
Help and searching for ci), prompts Stata to look for the file ci.help, just as it does
for the file ci.ado after the command ci is typed.
Stata looks for ado-files (and the associated help-files) in several places: the official ado-
directories (the base directory and the updates directory), the personal ado-directories,
the current directory.

4.4 MATRIX COMMANDS


A Stata matrix is a rectangular array of double-precision numbers, none of which can
be missing, and which is bordered by a row and a column of names. A vector is a
special case of a matrix. Although Stata has scalars, they could also be handled as
special cases of a matrix. Matrices can be used interactively or in programs, do-files
and ado-files.
By default, the maximum matrix size is 40 × 40. The maximum matrix size can be
increased to 800×800 by issuing the command set matsize 800. Thus, Stata matrices
are unsuited for holding large amounts of data.

4.4.1 ROW AND COLUMN NAMES


Stata matrices always have row and column names. These names are used to produce
“pretty” output. The matrix list command displays a matrix with its row and
column names.
Row and column names have three parts: equation_name:ts_operator,subname. The
first two parts may be blank.
Row and column names may be reset using the matrix rownames and matrix
colnames commands.

4.4.2 SUBSCRIPTING AND SUBMATRICES


The basic syntax for subscripting is
matrix A = ...B[r,c] ...
where r and c are numeric or string scalar expressions.
Examples:
. matrix A = A/A[1,1]
. matrix B = A["weight","displ"]
. matrix D = G[1,"eq1:l1.gnp"]
The basic syntax for extracting submatrices is
matrix A = ...B[r0..r1, c0..c1] ...
where r0, r1, c0, and c1 are numeric or string scalar expressions.
Examples:
PROGRAMMING AND MATRIX COMMANDS 43

. matrix A = B[2..4, 3..6]


. matrix A = B[2..., 2...]
. matrix A = B[1, "price".."mpg"]
. matrix A = B["eq1:", "eq1:"]
The basic syntax for substituting submatrices is
matrix A[r,c] = ...
where r and c are numeric scalar expressions.
If the matrix expression to the right of the equal sign evaluates to a scalar or a 1 × 1,
the indicated element of A is replaced. If the matrix expression evaluates to a matrix,
the resulting matrix is placed in A with its upper left corner at (r, c).
Examples:
. matrix A[2,2] = B
. matrix A[rownumb("price"), colnumb("mpg")] = sqrt(2)

4.4.3 MATRIX OPERATORS AND FUNCTIONS


The matrix operators are:
• -B (negation),

• B’ (transposition),

• B \ C (adds the rows of C below the rows of B),

• B , C (adds the columns of C to the right of the columns of B),

• B + C (addition),

• B - C (subtraction),

• B * C (multiplication, including multiplication by a scalar),

• B / z (division by a scalar z),

• B # C (Kronecker product).
Parentheses may be used to enforce a particular order of evaluation.
Examples
. matrix C = (B + B’)/2
The matrix functions returning scalar are:

• mreldif(B,C) (relative difference),

• trace(B) (trace of a square matrix),

• det(B) (determinant of a square matrix),


44

• diag0cnt(B) (number of zeros on diagonal),

• rowsof(B) (number of rows),

• colsof(B) (number of columns),

• rownumb(A,s) (first row number named s, where s is a string or string


expression),

• colnumb(A,s) (the first column number named s, where s is a string or string


expression),

• el(A,i,j) (the i, j element of A), this function is the same as A[i,j]).

Matrix functions returning scalar may be used in any expression context, not just
matrix expression contexts.
The matrix functions returning matrix are:

• I(n) (n × n identity matrix),

• J(n,m,z) (n × m matrix containing the constant element z),

• get(mname) (returns the system matrix mname),

• syminv(B) (inverse of a symmetric matrix, if B is not positive definite,


returns a generalized inverse),

• inv(B) (inverse of a square matrix),

• sweep(B,j) (sweep of a square matrix; returns B with jth row/column


swept),

• cholesky(B) (Cholesky decomposition of a symmetric matrix),

• corr(B) (correlation transform),

• diag(V) (V is a row or column n-vector, it returns a diagonal n × n matrix


with diagonal elements equal to those of V ),

• vecdiag(B) (returns a row vector containing the diagonal of a square matrix).

Examples
. matrix beta = syminv(X’*X)*X’*y
. display trace(X)
. matrix L = cholesky(0.1*I(rowsof(X)) + 0.9*X)
There are matrix utilities to list the currently defined matrices (matrix dir), display
the contents of a matrix (matrix list), rename a matrix (matrix rename) and drop
a matrix (matrix drop). The matrix drop _all command drops all matrices.
PROGRAMMING AND MATRIX COMMANDS 45

4.4.4 CROSS-PRODUCT MATRICES


Statistical computations often involve matrix operations such as X> X or X> WX. In
these cases, X usually has a large number of rows and a small to moderate number
of columns, whereas W takes on a restricted form (diagonal, block diagonal, or is
known in some functional form and need not be stored). Computing X> X or X> WX
by storing the matrices and then directly performing the matrix multiplications is
inefficient and wasteful. Stata ha a number of commands to compute these results
efficiently.
The matrix accum command accumulates cross-product matrices from the data to
form A = X> X.
The matrix glsaccum command accumulates cross-product matrices from the data
using a specified inner weight matrix to form A = X> BX, where B is a block diagonal
matrix.
The matrix vecaccum command accumulates the first variable against the remaining
variables to form the row vector a = X>
1 X, where X = (X2 , X3 , . . .).

4.4.5 DATA TO MATRIX CONVERSION


Variables can be converted into matrices and matrices into variables through the mkmat
and svmat commands. The mkmat command stores the k variables listed in varlist in
k column vectors of the same name. Optionally, if matrix() is specified, they can
be stored as a single matrix. The svmat command is the reverse of mkmat: it takes a
matrix and stores its columns as new variables.
Their syntax is
mkmat varlist [if exp] [in range] [, matrix(matname)]
svmat [type] A [, names(col|eqcol|matcol|string)]
where type is a storage type for new variables, A is the name of an existing matrix,
and the names(col|eqcol|matcol|string) option specifies how the new variables are
to be named.
The related command
matname A namelist [, rows(range) columns(range) explicit]
renames the rows and columns of a matrix.
Examples:
. mkmat mpg
. mkmat foreign weight displ, matrix(X)
. matrix b = syminv(X’*X) * X’*mpg
. mkmat bvector1 if bvector1˜
=.
. matrix list bvector1
. matrix d = bvector1’
. matname d wei gr for _cons, c(.)
. matrix list d
46

4.4.6 GETTING SYSTEM MATRICES


The usual way to obtain matrices after a command that produces matrices is to refer
to the returned matrix in the standard way. For example, all estimation commands
(see Section 5.1) return the coefficent vector e(b) and the variance-covariance matrix
v(b) of the estimates. These matrices can be referenced directly.
Examples:
. matrix list e(b)
. matrix S = vecdiag(e(V))
Other matrices are returned by various commands. They are obtained in the same
way. Alternatively, the matrix get command also obtains matrices after certain
commands.

4.4.7 MATRIX DECOMPOSITION


The matrix symeigen command returns the eigenvectors in the columns of the n × n
matrix X and the corresponding eigenvalues in the n-vector V . The eigenvalues are
sorted from largest to smallest: V[1,1] contains the largest eigenvalue and X[1...,1]
its corresponding eigenvector; V[1,2] contains the second largest eigenvalue and
X[1...,2] its corresponding eigenvector, and so on.
The singular value decomposition of a symmetric nonnegative definite matrix A is
carried out through the matrix svd command. This command returns an m × n
matrix U , a row n-vector W and an n × n matrix V such that A = U diag(W ) V > .
In addition, the columns of U are orthogonal, the elements of W are positive or zero,
and V is orthonornmal.
5

Statistical Inference Using Stata


5.1 ESTIMATION
5.1.1 GENERAL SYNTAX OF ESTIMATION COMMANDS
The general syntax of an estimation command is:
command varlist [weight] [if exp] [in range] [, options]
The first variable in varlist is the response or outcome variable, denoted by yvar, the
other variables are the covariates or predictors, denoted by xvars.
The general syntax for multiple-equation commands, namely commands that estimate
systems of equations, is similar:
command (varlist) (varlist) ... (varlist) [weight] [if exp]
[in range] [, options]
All estimation commands share the following common features:

• To review the last estimates, just type the estimation command without
arguments.

• In addition to the estimated parameters and their standard errors, confidence


intervals for the coefficients are displayed. The confidence level may be set
using the level(#) option, where # is the desired percentage level. The
default is level(95).

• The estimated variance matrix of the estimators is computed under the


assumption that the statistical model is correctly specified, but some
commands allow for certain forms of model misspecification with the robust
option.

• After estimation, one may obtain the estimated variance matrix of the
estimators using the vce command (Section 5.2.2).

• After estimation, one may obtain prediction, residuals and influence statistics
using the predict command (Section 5.2.3).

• After estimation, one can perform tests of hypotheses about the model
parameters (Section 5.2.4).
48

• The command lincom computes point estimates, standard errors, t statistics,


p-values and confidence intervals for a linear combination of coefficients after
any estimation command except anova.

5.1.2 WEIGHTED ESTIMATION


Specifying weights allows weighted estimation. For example
. regress y x1 x2 [pweight=w]
gives a weighted least squares regression of y on x1 and x2 using the probability weights
contained in the variable W .

5.1.3 CONSTRAINED ESTIMATION


Several commands (e.g. cnsreg) allow estimation subject to linear constraints on the
model parameters through the constraint(clist) option, where clist is of the form
#[-#][, #[-#] ...], with 1 ≤ # ≤ 999.
The constraint command defines, lists and drops linear constraints. Its syntax is:
constraint define # [exp=exp|coefficientlist]
constraint dir [clist|_all]
constraint drop {clist|_all}
constraint list [clist|_all]
where coefficientlist lists the variables whose coefficients are set equal to zero.
The following example estimates the linear model E(Y ) = α + β1 X1 + · · · + β6 X6
subject to the constraints that β1 = β2 = β3 = β6 and β4 = −β5 = α/10:
. constraint define 1 x1=x6
. constraint define 2 x2=x6
. constraint define 3 x3=x6
. constraint define 4 x4=-x5
. constraint define 5 x4=_cons/10
. cnsreg y x1-x6, constraint(1-3,4-5)

5.1.4 ROBUST VARIANCE ESTIMATES


Some commands (e.g. regress or glm) offer the option of estimating the variance
matrix of the parameter estimators by relaxing the assumption that the statistical
model is correctly specified and allowing for certain forms of model misspecification.
The robust options relaxes the assumption that the observations are identically
distributed, thus allowing for heteroskedasticity of unknown form.
The following example estimates the linear model E(Y ) = α+β1 X1 +β2 X2 respectively
without and with heteroskedasticity-robust variance estimates:
. regress y x1 x2
. regress y x1 x2, robust
STATISTICAL INFERENCE USING STATA 49

The robust cluster(varname) option only requires observations to be independent


across clusters specified by the variable varname, thus relaxing the assumption of
independence.

5.2 POST-ESTIMATION COMMANDS


5.2.1 ACCESSING COEFFICIENTS AND STANDARD ERRORS
After a (single-equation) estimation conmmand, _b[varname] (or _coef[varname])
contain the coefficient on varname and its standard error, both recorded to machine
precision. In case of multiple-equation estimation command, use [eqno]_b[varname]
(or simply [eqno][varname] or [eqno]varname) and [eqno]_se[varname], where
eqno is the equation number.
The command mfx produces tables displaying the marginal effects or the elasticities
(and their standard errors) instead of the estimated coefficients.

5.2.2 DISPLAYING THE VARIANCE ESTIMATES


After model estimation, the estimated variance or correlation matrix of the estimators
is displayed using the command
vce [, corr rho]
where corr and rho are synomis and either displays the correlation matrix instead of
the variance matrix.
To obtain a copy of the estimated variance matrix for manipulations type
matrix matname = e(V)

5.2.3 PREDICTIONS AND RESIDUALS


The predict command calculates predictions, residuals and influence statistics after
estimation. What predict can do depends to some extent on the previous estimation
command. The general features of predict are:

1. The predict newvarname command creates newvarname containing the


“predicted values” of the response. For example, after linear regression,
predict newvarname creates the fitted values β̂ > Xi . After probit, it creates
the estimated probit probabilities Φ(β̂ > Xi ).

2. The predict newvarname, xb command creates newvarname containing


the linear prediction β̂ > Xi . For linear models, this command produces the
same result as predict newvarname.

3. The predict newvarname, stdp command creates newvarname containing


the standard error of the linear prediction.

4. Adding the nooffset option to any of the above makes the calculation
ignoring any offset or exposure variable specified in the estimation command.
50

5. predict can be used to make in-sample or out-of-sample predictions.

6. In general, predict calculates the requested statistic for all observations


possible, whether they were used in estimating the model or not.

7. One can restrict the prediction to the estimation sample by typing


. predict newvarname if e(sample), ...

8. Some statistics make sense only with respect to the estimation sample. In such
cases, the calculation is automatically restricted to the estimation sample.

9. Out-of-sample predictions may be obtained by applying predict to other


datasets.
Example:
. use data1
model estimation commands
. use data2 /* another dataset */
. predict hat, ... /* fill in the predictions */

The options of predict are:

• xb calculates the linear prediction from the estimated model.

• stdp calculates the standard error of the linear prediction.

• stddp is allowed only after multiple-equation estimation commands. It


computes the standard error of the difference in linear predictions between
two equations.

• equation(eqno[,eqno]) is only relevant after multiple-equation estimation


commands. It specifies to which equation one is referring.
equation(#1) means that calculations are to be made for the first
equation, equation(#2) that they are to be made for the second, and so
on. Alternatively, one could refer to the equations by their names. For
example, equation(income) would refer to the equation named income,
equation(hours) to the one named hours.
equation(#1) is the default when equation() is not specified.
Other statistics (for example stddp) refer to between-equation concepts. In
those cases, one may use equation(#1,#2) or equation(income,hours).
When two equations must be specified, equation() is not optional.

• The nooffset option may be combined with most statistics and specifies
that the calculation should be made ignoring any offset or exposure variable
specified when the model was estimated. This option is available even if not
documented for predict after a specific command. If neither the offset()
STATISTICAL INFERENCE USING STATA 51

nor the exposure() option was specified at the model estimatio stage,
specifying nooffset does nothing.

• other_options refers to command-specific options that are documented with


each command.

5.2.4 HYPOTHESIS TESTING


The test command performs Wald-type tests of linear hypothees, the testnl
command performs Wald-type tests of nonlinear (or linear) hypothees, and the lrtest
command performs likelihood-ratio tests after ML estimation.

5.3 BOOTSTRAPPING AND MONTE CARLO SIMULATIONS


Bootstrap and Monte Carlo simulations rely on Stata’s uniform() random number
generator. Reproducibility of the results requires setting the random-number seed by
typing set seed #.

5.3.1 BOOTSTRAP
The command
bstrap progname [, reps(#) size(#) dots args(...) level(#)
cluster(varnames) idcluster(newvarname) saving(filename)
double every(#) replace noisily]
runs the user-defined program progname reps(#) times on bootstrap samples of size
size(#).
The command
bs "command" "exp_list" [, bstrap_options nowarn noesample]
runs the user-specified command bootstrapping the statistics specified in exp_list.
The expressions in exp_list must be separated by spaces and there must be no spaces
within each expression. Note that command and exp_list must both be enclosed in
double quotes. This command takes the same options as bstrap except for args().
The command
bstat varlist [, stat(#) level(#) title(text)]
displays bootstrap estimates of standard error and bias, and calculates confidence
intervals using three different methods: normal approximation, percentile, and bias
corrected. The bstrap and bs commands automatically run bstat after completing
all the bootstrap replications. If the user specifies the saving(filename) option with
bstrap or bs, then bstat can be run on the data in filename to view the bootstrap
estimates again.
Finally, the command
bsample [exp] [, cluster(varnames) idcluster(newvarname)]
52

is a low-level utility for those who prefer not to use bstrap or bs. It draws a sample
with replacement from the existing data; the sample replaces the data in memory;
exp specifies the size of the sample and must be less than or equal to _N. If exp is
not specified, a sample size of _N is drawn (or size n_c when the cluster() option is
specified where n_c is the number of clusters).

5.3.2 MONTE CARLO SIMULATION


The simul command is aimed at easing the programming task of performing Monte
Carlo simulations. Its syntax is:
simul progname, reps(#) [args(whatever) dots double
saving(filename) every(#) replace noisily]
where progname is the name of a program that performs a single simulation, reps(#)
(not optional) specifies the number of replications to be performed, and args(sl
whatever) specifies any arguments to be passed to progname.
Typing "simul progname, reps(#)" iterates progname # replications and collects
the results.
6

Statistical Models in Stata


A broad range of statistical models may be estimated directly using the available
Stata commands. In this note, I focus on estimation of linear models (Section 6.1),
generalized linear models (Section 6.2) and parametric models for duration data
(Section 6.4.1), and only provide a brief description for a number of other models.

6.1 LINEAR MODELS


Stata offers several commands for estimating linear models.

6.1.1 ORDINARY LEAST SQUARES


The regress command estimates a linear model by least squares (ordinar least squares
or weighted least squares). Its syntax is:
regress yvar [xvars] [weight] [if exp] [in range] [, level(#)
beta robust cluster(varname) hc2 hc3 hascons noconstant
tsscons noheader eform(string) depname(varname) mse1 plus]
where beta requests that the normalized regression coefficients be reported instead
of confidence intervals, robust and cluster(varname) have been discussed in
Section 5.1.4, hc2 and hc3 specify alternative bias corrections for robust (they may
not be specified with cluster()), hascons indicates that a user-defined constant or its
equivalent is specified among the independent variables (some caution is recommended
when using this option as resulting estimates may not be as accurate as they otherwise
would be), noconstant suppresses the constant term (intercept) in the regression,
tsscons forces the total sum of squares to be computed as though the model has a
constant (i.e., as deviations from the mean of the dependent variable), and noheader,
eform(), depname(), mse1 and plus are for ado-file writers.
The syntax of predict following regress is:
predict [type] newvarname [if exp] [in range] [, statistic]
where, in addition to xb (the default) and stdp, statistic may be: pr(a,b) (Pr(Y | a <
Y < b}), e(a,b) (E(Y | a < Y < b)), ystar(a,b) (E max(a, min(Y, b)), cooksd
(Cook’s distance), leverage|hat (diagonal elements of hat matrix), residuals
(residuals), rstandard (standardized residuals), rstudent (Studentized or jackknifed
residuals), stdf (standard error of the forecast), stdr (standard error of the residual),
54

covratio (COVRATIO), dfbeta(varname) (DFBETA for varname), dfits (DFITS),


or welsch (Welsch distance).

In addition to predict, the following commands can be used after regress for
diagnosing sensitivity to individual observations:

• avplot (graphs an added-variable or leverage plot),

• cprplot (graphs a partial residual plot),

• lvr2plot (graphs a leverage vs. squared residual or L-R plot),

• rvfplot (graphs a residual-versus-fitted plot),

• rvpplot (graphs a residual-versus-predictor plot),

• ovtest (performs Ramsey’s RESET test for omitted variable),

• hettest (performs the Cook-Weisberg test for heteroskedasticity),

• dwstat (computes the Durbin-Watson test statistic),

• dfbeta (calculates the DFBETAs),

• vif (calculate the variance inflation factors).

6.1.2 CONSTRAINED LINEAR REGRESSION


The cnsreg command estimates constrained linear regression models. Its syntax is:

cnsreg yvar xvars [weight] [if exp] [in range],


constraints(numlist) [level(#)]

where constraint(numlist) (not optional) specifies the constraint numbers of the


constraints to be applied (see Section 5.1.3). Constraints are defined using the
constraint command.

6.1.3 LINEAR INSTRUMENTAL VARIABLES


The ivreg command estimates a linear regression model using instrumental variables
(or two-stage least squares) of yvar on xvars1 and xvars2 using ivars (along with
xvars1) as instruments for xvars2. The variables in xvars1 and ivars are the exogenous
variables, those in xvars2 are the endogenous variables.

The syntax of this command is:

ivreg yvar [xvars1] (xvars2=ivars) [weight] [if exp] [in range]


[, level(#) beta hascons noconstant robust cluster(varname)
first noheader eform(string) depname(varname) mse1]
STATISTICAL MODELS IN STATA 55

6.2 GENERALIZED LINEAR MODELS


Stata offers a single and very flexible command (glm) to estimate generalized linear
models (McCullagh & Nelder 1989). It also offers, for selected models in this class,
special commands (e.g. logit, probit, poisson) with a broader and more specific set
of options, especially diagnostics and other post-estimation output.

6.2.1 GLM
The glm command fits generalized linear models. Estimation is carried out either by
iteratively reweighted least squares (IRLS) or by using the Newton-Raphson (NR)
method, which is the default.
The basic syntax is:
glm yvar [xvars] [weight] [if exp] [in range] [, max_options
var_options output_options spec_options]
The max_options are:
iterate(#) ltolerance(#) mu(varname) nolog search fisher(#) irls
where iterate(#) specifies the maximum number of iterations allowed in estimating
the model (iterate(50) is the default), ltolerance(#) specifies the convergence
criterion for the change in deviance between iterations (ltolerance(1e-6) is the
default), mu(varname) specifies varname as the initial estimate for the mean of yvar,
nolog suppresses the iteration log, search specifies that the command should search
for good starting values, fisher(#) specifies the number of NR steps that should
use the Fisher scoring Hessian or expected information matrix before switching to the
observed information matrix (both search and fisher() are only useful with NR
optimization, not with IRLS), and irls requests IRLS minimization of the deviance
instead of NR maximization of the log-likelihood.
The var_options are:
oim opg vfactor(#) robust cluster(varname) unbiased
nwest(wtname [#]) jknife jknife1 bstrap brep(#)
scale(x2|dev|#) disp(#) score(newvar) t(varname)
where oim specifies that the variance matrix should be calculated using the observed
information matrix rather than the usual expected information matrix (option ignored
if irls is not specified), opg specifies that the variance matrix be calculated using
the Berndt, Hall, Hall, and Hausman (1976) variance estimator (this option is not
allowed when cluster() is specified), vfactor(#) specifies a scalar by which to
multiply the resulting variance matrix, robust and cluster(varname) have already
been defined, unbiased specifies that the unbiased sandwich estimate of variances
be used (robust is implied when unbiased is used), nwest(wtname [#]) specifies
that a heteroskedasticity and autocorrelation consistent variance estimate be used,
jknife and jknife1 specify that jackknife estimates of variance be used, bstrap
specifies that the bootstrap estimate of variance be used, brep(#) specifies the
number of bootstrap samples to consider in forming the bootstrap estimate (the
default is brep(199)), scale(x2|dev|#) overrides the default scale parameter (by
56

default, scale(1) is assumed for discrete distributions and scale(x2) for continuous
distributions), scale(x2) specifies the scale parameter be set to the Pearson chi-
squared (or generalized chi-squared) statistic divided by the residual degrees of
freedom, scale(dev) sets the scale parameter to the deviance divided by the
residual degrees of freedom (this provides an alternative to scale(x2) for continuous
distributions and over- or under-dispersed discrete distributions) scale(#) sets the
scale parameter to #, disp(#) multiplies the variance of yvar by # and divides
the deviance by #, score(newvar) creates the new variable newvar containing each
observation’s contribution to the score, and t(varname) specifies the variable name
corresponding to the time index (this option is required if nwest() is specified).
The output_options are:
eform level(#) trace noheader nodisplay nodots
where eform displays the exponentiated coefficients and corresponding standard errors
and confidence intervals (for binomial models with the logit link, exponentiation results
in odds ratios; for Poisson models with the log link, exponentiated coefficients are
rate ratios), trace requests that the estimated coefficient vector be printed at each
iteration, noheader suppresses the header information from the output (the coefficient
table is still printed), nodisplay suppresses the output (the iteration log is still
displayed), and nodots specifies that a dot should not be printed for each fitted model
when calculating jackknife or bootstrap estimates (by default, a single dot character
is printed for each estimation that is performed).
The spec_options are:
family(familyname) link(linkname) noconstant [ln]offset(varname)
where family(familyname) specifies the parametric family, link(linkname) specifies
the link function, noconstant specifies that the linear predictor has no intercept term,
and [ln]offset(varname) specifies an offset to be added to the linear predictor.
familyname is either a user-written program or one of: binomial (Bernoulli/binomial),
gamma, gaussian, igaussian (inverse Gaussian), nbinomial (negative binomial),
poisson.
linkname is either a user-written program or one of: cloglog (complementary log-log),
identity, log, logit, logc (log-complement), loglog (log-log), nbinomial (negative
binomial), opower # (odds power), power #, probit.
If family() is specified but not link(), then the canonical link for the family is
obtained, namely:

• link(identity) for family(gaussian) (same as regress),

• link(power -2) for family(igaussian),

• link(logit) for family(binomial) (same as logit),

• link(log) for family(poisson) (same as tt poisson),

• link(log) for family(nbinomial) (same as nbreg),


STATISTICAL MODELS IN STATA 57

• link(power -1) for family(gamma).

The syntax of predict after glm is:


predict [type] newvarname [if exp] [in range] [, statistic nooffset
standardized studentized modified adjusted]
where, in addition to xb and stdp, statistic may be: mu (predicted mean of the response,
the default), eta (same as the xb option), cooksd (Cook’s distance), deviance
(deviance residual), hat (diagonal of the hat matrix), likelihood (likelihood residual),
pearson (Pearson residual), response (response residual), score (score residual), or
working (working residual).

6.2.2 LOGIT AND PROBIT


The logit command estimates logit models by maximum-likelihood (ML). Its syntax
is:
logit yvar [xvars] [weight] [if exp] [in range] [, level(#)
nocoef noconstant or robust cluster(varname) score(newvar)
offset(varname) asis max_options]
where yvar==0 indicates a negative outcome, yvar˜=0 & yvar˜=. (typically yvar==1)
indicates a positive outcome.
The logistic command is just the same as logit, except that it reports odds ratios
rather than coefficients by default.
The following commands can be used after both logit or logistic to explore the
nature of the fit: lfit (performs goodness-of-fit tests), lstat (reports summary
statistics including classification table), lroc (graphs the ROC curve), and lsens
(graphs sensitivity and specificity versus probability cutoff).
The syntax of predict following logit or logistic is:
predict [type] newvarname [if exp] [in range] [, statistic rules
asif nooffset]
where, in addition to xb and stdp, statistic may be: p (predicted probability of
a positive outcome, the default), dbeta (Delta-Beta influence statistic, Pregibon
1981), deviance (deviance residual), dx2 (Delta chi-squared infl. stat., Hosmer &
Lemeshow 1989), ddeviance (Delta-D influence statistic, Hosmer & Lemeshow 1989),
hat (leverage, Pregibon 1981), number (sequential number of the covariate pattern),
or residuals (Pearson residual).
The probit commands is completely analogous and estimates probit models by ML.

6.2.3 POISSON AND NBREG


The command poisson produces ML estimates of the Poisson regression model. Its
syntax is:
poisson yvar [xvars] [weight] [if exp] [in range] [, irr level(#)
58

exposure(varname) offset(varname) robust cluster(varname)


score(newvarname) noconstant constraints(numlist) nolog
max_options]
where depvar is a nonnegative count variable and irr reports estimated coefficients
transformed to incidence rate ratios.
The syntax of predict after poisson is:
predict [type] newvar [if exp] [in range] [, statistic nooffset]
where, in addition to xb and stdp, statistic may be n (predicted number of events,
the default), or ir (incidence rate, is equivalent to predict ..., n nooffset).
The command nbreg produces ML estimates of the negative binomial regression model
(Poisson regression with overdispersion).
nbreg yvar [xvars] [weight] [if exp] [in range] [,
dispersion(mean|constant) level(#) irr exposure(varname)
offset(varname) robust cluster(varname) score(newvarnames)
noconstant constraints(numlist) nolrtest nolog max_options]
where depvar is a nonnegative count variable.
Two different parameterizations of the negative binomial model may be estimated.
The default, also given by the option dispersion(mean), has dispersion for the ith
observation equal to 1 + α exp(β > Xi + offset), that is, the dispersion is a function
of the expected mean of the counts for the ith observation: exp(β > Xi + offset).
The alternative parameterization, given by the option dispersion(constant), has
constant dispersion for all observations equal to 1 + δ.
For the default model, α = 0 (or ln(α) = −∞) corresponds to unit dispersion, and,
thus, it is simply a Poisson model. For the alternative parameterization, delta = 0 (or
ln(δ) = −∞) corresponds to unit dispersion, and is simply a Poisson model.
The syntax of predict after nbreg is the same as after poisson.

6.3 OTHER LIMITED DEPENDENT VARIABLES MODELS


6.3.1 GROUPED BINARY RESPONSES
The blogit and bprobit commands produce ML estimates of the logit and probit
models for grouped data. The glogit and gprobit commands produce weighted least-
squares (mimimum chi-square) estimates.

6.3.2 ORDERED CATEGORICAL RESPONSES


The ologit and oprobit commands estimate ordered logit and probit models of
ordinal variable depvar on the covariates. The actual values taken on by the response
variable are irrelevant except that larger values are assumed to correspond to “higher”
outcomes. Up to 50 outcomes are allowed.
STATISTICAL MODELS IN STATA 59

6.3.3 NESTED LOGIT


The command nlogit estimates a nested logit model by ML. The model may contain
one or more levels. For a single-level model, nlogit estimates the same model as
clogit.

6.3.4 MULTINOMIAL LOGIT


The mlogit command estimates multinomial logit models by ML. Constraints may
be defined to perform constrained estimation.

6.3.5 BIPROBIT
The biprobit command produces ML estimates of two-equation probit models, either
a bivariate probit or a system of two seemingly unrelated probit equations.

6.3.6 CENSORED AND TRUNCATED REGRESSION


The tobit command produces ML estimates of censored Gaussian regression models
with a fixed censoring point. The cnreg estimates the same class of models but allows
the censoring points to vary across observations.
The intreg command estimates a bivariate censored Gaussian model where the
response variables (Y1 , Y2 ) can be point data, interval data (e.g. a ≤ Yi ≤ b), left-
censored (e.g., a ≥ Y1 ), or right-censored (e.g. Y1 ≤ b).
The heckman command estimates Gaussian linear models with sample selection
using either Heckman’s two-step estimator (Heckman 1976) or full ML. The related
command heckprob estimates probit models with sample selection by ML.
The truncreg command produces ML estimates of a truncated Gaussian regression
model.

6.4 DURATION DATA


6.4.1 PARAMETRIC DURATION MODELS
The ereg and weibull commands produce ML estimates respectively of the
exponential and Weibull (survival time) models, with and without gamma-distributed
or inverse Gaussian unobserved heterogeneity (frailty). Their syntax is:
{ereg[het]|weibull[het]} yvar [xvars] [weight] [if exp] [in range]
[, hazard hr tr dead(varname) t0(varname)
frailty(gamma|invgaussian) ancillary(varlist) strata(varname)
robust cluster(varname) score(newvars) constraints(numlist)
level(#) nocoef noheader nolog maximize_options]
The syntax of predict following ereg and weibull is:
predict [type] newvarname [if exp] [in range] [, statistic]
where, in addition to xb and stdp, statistic may be: median time, (predicted median
60

survival time, the default), median lntime, (predicted median log survival time),
mean time (predicted mean survival time), mean lntime (predicted mean log survival
time), hazard (predicted hazard), hr (predicted hazard ratio), surv (predicted survival
probability), csnell (partial Cox-Snell residuals), or mgale (partial martingale-like
residuals).

6.4.2 COX PROPORTIONAL HAZARD MODEL


The cox command estimates proportional hazards models by ML. The covariates may
be either fixed or time-varying (fixed within intervals). The procedure allows for left
truncation (delayed entry), as well as gaps and right censoring. The failure event may
be unique or recurring.
A simplified version of cox is stcox.

6.5 TIME SERIES


The tsset timevar command declares the data to be a time series and designates
that variable timevar (which must take on integer values) represents time. The tsset
command must be used before time-series operators may be used. After tsset, the
data will be sorted on timevar.

6.5.1 LINEAR MODELS WITH AUTOCORRELATED ERRORS


The prais command estimates a linear model with first-order autoregressive errors
using the Prais—Winsten transformed regression estimator, the Cochrane—Orcutt
transformed regression estimator, or a version of the Hildreth—Lu search method.
The newey command produces estimated standard errors for the OLS coefficients of
linear regression models with heteroskedastic and possibly autocorrelated errors (see
Newey & West 1987).

6.5.2 ARIMA MODELS


The arima command estimates a linear model with autoregressive moving-average
(ARMA) errors. The response variable and the covariates may be differenced or
seasonally differenced to any degree. When no covariate is specified, the conmmand
estimates autoregressive integrated moving-average (ARIMA) models for the response
variable. Missing data are allowed and are handled using the Kalman filter.

6.5.3 ARCH-TYPE MODELS


The arch command estimates models with autoregressive conditional heteroskedastic-
ity (ARCH) using conditional ML. In addition to ARCH terms, models may include
multiplicative heteroskedasticity. Concerning the regression equation itself, models
may also contain ARCH-in-mean and/or ARMA terms.
The following options may be used to estimate a number of models in the ARCH
family:
STATISTICAL MODELS IN STATA 61

• arch() (ARCH),

• arch() garch() (GARCH),

• archm arch() [garch()] (ARCH-in-mean),

• arch() garch() ar() ma() (GARCH with ARMA terms),

• earch() egarch() (EGARCH),

• abarch() atarch() sdgarch() (TARCH, threshold ARCH),

• arch() tarch() [garch()] (GJR, form of threshold ARCH),

• arch() saarch() [garch()] (SAARCH), simple asymmetric ARCH),

• parch() [pgarch()] (PARCH, power ARCH),

• narch() [garch()] (NARCH, nonlinear ARCH),

• narchk() [garch()] (NARCHK, NARCH with a single shift),

• aparch() [pgarch()] (A-PARCH, asymmetric power ARCH),

• nparch() [pgarch()] (NPARCH, nonlinear power ARCH).

6.6 PANEL DATA


The xt series of commands provides tools for analyzing longitudinal (panel) data.
Each Each observation in a longitudinal dataset is indexed by a unit-specific index i
and a time-specific index t. The iis command or the i() option set the name of the
variable corresponding to index i, while the tis command or the t() option set the
name of the variable corresponding to index t.

Some of the xt commands use time-series operators in their internal calculations and
thus require the data to be tsset.

6.6.1 LINEAR PANEL DATA MODELS


The xtreg command estimates linear panel data model. This command can estimate
fixed-effects (within-group), between-group, and random-effects models as well as
population-averaged models. Which estimator is used is determined by the following
options:

• be (between-group estimator),

• fe, (fixed-effects estimator),

• re (GLS random-effects estimator),

• pa (GEE population-averaged estimator),


62

• mle (Gaussian ML random-effects estimator).


B If no option is specified, re is assumed.
The xttest0 and xthaus commands after xtreg, re perform, respectively, the
Breusch and Pagan (1979) Lagrange multiplier test for random effects and the
Hausman (1978) specification test.

6.6.2 DYNAMIC PANEL DATA MODELS


The xtabond command estimates dynamic panel data models using Arellano and Bond
(1989) one-step, one-step robust or two-step estimators. The command can be used
with exogenously unbalanced panels and handles embedded gaps in the time series as
well as opening and closing gaps.

6.6.3 SEEMINGLY UNRELATED REGRESSION EQUATIONS


The sureg command estimates a system of seemingly unrelated linear regression
equations by feasible generalized least-squares.
This command is not part of the xt series.

6.6.4 GEE FOR PANEL DATA


The xtgee command generalizes the glm command to panel data. It is very flexible and
allows estimation of generalized linear models for panel data (see Liang & Zeger 1986)
with different choices of parametric family, link function and within-group correlation
structures.
The allowed distribution families are the same as for glm, namely Bernoulli/binomial,
gamma, Gaussian (normal), inverse Gaussian, negative binomial, Poisson and user-
supplied.
The allowed link functions are also the same, namely complementary log-log, identity,
log, log-complement, log-log, logit, negative binomial, odds power, power, probit and
user-supplied.
The allowed within-group correlation structures include independence, equicorrelation,
kth order autocorrelation, kth order moving average, unstructured (arbitrary non-
stationary) and user-supplied.

6.6.5 LOGIT AND PROBIT FOR PANEL DATA


The xtlogit command estimates a fixed-effects (fe), a random-effects (re), or a
population-averaged (pa) logit model for panel data.
The xtprobit command estimates random-effects (re) and population-averaged (pa)
probit models for panel data. The integrals in the individual terms of the log-likelihood
of the random-effects model are computed using Gauss-Hermite quadrature. After
estimating the model, the quality of the Gauss-Hermite quadrature approximation
may be checked using the quadchk command.
STATISTICAL MODELS IN STATA 63

Notice that the xtlogit, fe command is equivalent to the clogit command. Also
notice that xtlogit, pa corresponds to
xtgee, family(binomial) link(logit) corr(exchangeable)
whereas xtprobit, pa corresponds to
xtgee, family(binomial) link(probit) corr(exchangeable)

6.6.6 POISSON AND NEGATIVE BINOMIAL MODELS


The xtpois command estimates fixed-effects (fe), random-effects (re), or population-
averaged (pa) Poisson models for panel data. For the re option, either a gamma (the
default) or a normal (Gaussian) distributed random-effe cts model is estimated.
As for xtprob, re, the integrals in the individual terms of the log-likelihood of the
Gaussian random-effects model are computed using Gauss-Hermite quadrature. After
estimating the model, the quality of the Gauss-Hermite quadrature approximation
may be checked using the quadchk command.
The xtnbreg command estimates fixed-effects (fe), random-effects (re), or
population-averaged (pa) negative binomial models for panel data.
For both xtpois and xtnbreg, the population-averaged model assumes equicorrelation
as default (that is, corr(exchangeable)).

6.7 NONPARAMETRIC ESTIMATION


6.7.1 DENSITY ESTIMATION
The kdensity command produces kernel density estimates and graphs the result.
The available options for the kernel function are: biweight, cosine, epan
(Epanechnikov, the default), gauss (Gaussian), parzen, rectangle (uniform),
triangle.
The width(#) option specifies the halfwidth of the kernel. If width() is not specified,
then Stata uses the asymptoticaly optimal width for Gaussian data and a Gaussian
kernel.

6.7.2 REGRESSION SMOOTHERS


The ksm command carries out unweighted and locally weighted smoothing of a
response variable yvar on a single covariate xvar, displays the graph, and optionally
saves the smoothed variable. Among the command’s capabilities are lowess (robust
locally weighted regression, Cleveland 1979).
The smooth command applies resistant, nonlinear smoothers (running medians) to
a single variable varname and stores the new series in newvar. Missing values at the
beginning or end of the range of varname are ignored, but missing values in the middle
of the series are not allowed.
64

6.8 ROBUST AND QUANTILE REGRESSION


6.8.1 ROBUST REGRESSION
The rreg command estimates a linear model by iteratively reweighted least squares
using a particular set of robust weights.

6.8.2 QUANTILE REGRESSION


The qreg command estimates quantile (including median) regression models.
The iqreg command estimates interquantile regressions (with a limit of 336
covariates). The estimated variance matrix of the estimators is obtained by bootstrap.
The sqreg command estimates simultaneous-quantile regression and produces the
same coefficients as qreg for each quantile. The estimated variance matrix of the
estimators is obtained by bootstrap and includes between-quantiles blocks . Thus, one
can test and construct confidence intervals comparing coefficients describing different
quantiles. This command has a limit of 336/q covariates, where q is the number of
quantiles specified.
The bsqreg command is the same as sqreg. Although not as fast, it is not limited to
336 coefficients.

6.9 GENERAL NONLINEAR METHODS


The nl command fits an arbitrary nonlinear function to a response variable yvar by
least squares. The function must be provided in a separate program.
The ml series of commands allows estimation of an arbitrary model by ML. See Gould
and Scribney (1999) for details.
References

Berndt E., Hall B., Hall R., and Hausman J.A. (1974) Estimation and Inference in Nonlinear
Structural Models. Annals of Economic and Social Measurement, 3/4: 653—665.
Breusch T.V. and Pagan A.R. (1980) The Lagrange Multiplier Test and Its Applications to
Model Specification in Econometrics. Review of Economic Studies, 47: 239—254.
Cleveland W.S. (1979) Robust Locally Weighted Regression and Smoothing Scatterplots.
Journal of the American Statistical Association, 74: 829—836.
Gould W. and Scribney W. (1999) Maximum Likelihood Estimation with Stata, Stata
Corporation, College Station, TX.
Hausman J.A. (1978) Specification Tests in Econometrics. Econometrica, 46: 1251—1272.
Heckman J.J. (1976) The Common Structure of Statistical Models of Truncation, Sample
Selection and Limited Dependent Variables and a Simple Estimator for Such Models.
Annals of Economic and Social Measurement, 5: 475—492.
Hosmer D.W. and Lemeshow S. (1989) Applied Logistic Regression, Wiley, New York.
Liang, K.Y. and Zeger S.L. (1986) Longitudinal Data Analysis Using Generalized Linear
Models. Biometrika, 73: 13—22.
McCullagh P. and Nelder J.A. (1989) Generalized Linear Models (2nd ed.), Chapman and
Hall, London.
Newey W.K. and West K. (1987) A Simple, Positive Semi-Definite, Heteroskedasticity and
Autocorrelation Consistent Covariance Matrix. Econometrica, 55: 703—708.
Peracchi F. (2001) Econometrics, Wiley, Chichester, UK.
Pregibon D. (1981) Logistic Regression Diagnostics. Annals of Statistics, 9: 705—724.

Vous aimerez peut-être aussi