Académique Documents
Professionnel Documents
Culture Documents
ii
An Introduction to Stata
F. PERACCHI
Faculty of Economics, Tor Vergata University, Rome, Italy
iv
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 STARTING AND STOPPING STATA . . . . . . . . . . . . . . . . . . 1
1.1.1 THE STATA WINDOWS . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 THE STATA TOOLBAR . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 ALLOCATING MEMORY TO STATA . . . . . . . . . . . . . 2
1.2 STATA DOCUMENTATION AND UPDATES . . . . . . . . . . . . . 3
1.2.1 THE HELP SYSTEM . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 THE REFERENCE MANUAL . . . . . . . . . . . . . . . . . . 3
1.2.3 THE STATA TECHNICAL BULLETIN . . . . . . . . . . . . . 3
1.2.4 TUTORIALS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 STATA UPDATES . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 VARIABLES AND OBSERVATIONS . . . . . . . . . . . . . . . . . . 4
1.3.1 VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 OBSERVATIONS . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 INPUTTING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 DIRECT TYPING . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 THE DATA EDITOR . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.3 LOADING AN ASCII (TEXT) DATA FILE . . . . . . . . . . 6
1.4.4 LOADING A STATA DATA FILE . . . . . . . . . . . . . . . . 7
1.5 BASIC DATA MANIPULATION . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 DISPLAYING DATA . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.2 LABELING DATA . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.3 SUMMARIZING DATA . . . . . . . . . . . . . . . . . . . . . . 9
1.5.4 CREATING NEW VARIABLES . . . . . . . . . . . . . . . . . 9
1.5.5 CHANGING AND RENAMING VARIABLES . . . . . . . . . 11
1.5.6 ELIMINATING VARIABLES OR OBSERVATIONS . . . . . . 12
1.5.7 INCREASING THE NUMBER OF OBSERVATIONS IN A
DATASET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 OUTPUTTING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 LOG FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Stata Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 GENERAL SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vi CONTENTS
2.1.1 BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 WEIGHTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 IF AND IN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 QUIETLY AND NOISILY . . . . . . . . . . . . . . . . . . . . . 16
2.2 BASIC DATA COMMANDS . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 DESCRIBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 LIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 DROP AND KEEP . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 GENERATE AND EGEN . . . . . . . . . . . . . . . . . . . . . 18
2.2.5 REPLACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.6 SORT AND GSORT . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 COMBINING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 APPEND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 MERGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 RESHAPING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 COLLAPSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 CONTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 EXPAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.4 FILLIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.5 RESHAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 BASIC SAMPLE STATISTICS . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 COUNT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 SUMMARIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.3 MEANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.4 CENTILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.5 CUMUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.6 CORRELATE . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.7 REGRESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.1 TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 TABULATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.3 TABSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 BASIC SYNTAX AND GRAPHIC STYLES . . . . . . . . . . . . . . 29
3.2 COMMON GRAPH OPTIONS . . . . . . . . . . . . . . . . . . . . . . 30
3.3 HISTOGRAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 TWO-WAY SCATTERPLOTS . . . . . . . . . . . . . . . . . . . . . . 32
3.5 TWO-WAY SCATTERPLOT MATRICES . . . . . . . . . . . . . . . 33
3.6 BOX PLOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Introduction
Why use Stata? In my view, it has three main advantages over other statistical
packages.
The first is portability: Stata runs on several platforms (Macintosh, Unix, Windows),
and Stata programs written for one of them run with (almost) no change on any other
one. The latest release is Stata 7.0. In what follows I focus on Stata 7.0 for Windows
98/95/NT.
The second is speed: Stata is fast because all data manipulations are carried out in
the RAM. The only limit is the amount of RAM available. With 100MB of RAM, one
can work with a dataset containing 5 million observations on 4 real-valued variables
or, equivalently, with one million observations on 20 real-valued variables.
Getting Started
This chapter introduces the main aspects of Stata, namely how to start and stop the
program (Section 1.1), where to look for documentation and updates (Section 1.2), the
definition of variables and observations (Section 1.3), how to input data (Section 1.4),
basic data manipulation (Section 1.5), how to output data (Section 1.6), and how to
open and close log files (Section 1.7).
I adopt the following typographic conventions: the typewriter-style typeface is used
for Stata commands or options that have to be typed in (e.g. describe or generate),
italics is used for things that must be substituted for by some other word (e.g. varname
or varlist), small caps is used for keyboard keys (e.g. Enter or Ctrl+Break) and
boldface is used for Windows commands or switches (e.g. Exit or Help).
The Stata Command window follows standard Window editing style. The keys for
editing in the Command window are Delete, Backspace, Esc, Home, End, Page
Up and Page Down. One can copy one line at a time from the Results window into
the clipboard and paste into the Command window. Clicking once a command in the
Review window copies the command on the Command window where it can be edited
before being entered.
4. Begin Log (starts a new log, appends to an existing log, and stops or suspends
the current log),
6. Bring Dialog Window to Front (brings the Dialog window to the front of the
other Stata windows),
7. Bring Results Window to Front (brings the Result window to the front of
the other Stata windows),
8. Bring Graph Window to Front (brings the Graph window to the front of the
other Stata windows),
9. Do-file Editor (opens the Do-file editor or brings the Do-file Editor window
to the front of the other Stata windows),
10. Data Editor (opens the data editor or brings the Data Editor window to the
front of the other Stata windows),
11. Data Browser ((opens the data browser or brings the Data Browser window
to the front of the other Stata windows),
12. Clear — more — Condition (tells Stata to continue when it has puased in the
middle of a long output),
the Stata icon, pull down File and choose Properties, click on the Shortcut tab
and put k# (in kilobytes) or m# (in megabytes) after the call to wstata.exe in the
Target line.
The Start in line specifies the initial working directory. This may be changed
by editing the line or by using the cd drive:/directory_name command from the
Command line.
Memory allocation may also be changed within a given Stata session (although not
permanently) by using the set memory #k command (in kilobytes) or the set memory
#m command (in megabytes). This command requires that no data be present in
memory.
If more memory is used than physically available on the computer, Stata slows down.
In this case, it is recommended to set virtual memory on by typing
. set virtual on
1.2.4 TUTORIALS
Stata provides tutorials on a variety of aspects: introduction to Stata (intro.tut),
data input, graphics, tables, and procedures for statistical modeling.
To run a tutorial, type tutorial tutname, where tutname is any of the following:
• anova (estimating one-, two- and N-way ANOVA and ANCOVA models),
1.3.1 VARIABLES
Variables come in two types: alphabetic (strings) or numeric (real or integer valued).
Variables are called by their name, which must be 1 to 8 characters long. The first
GETTING STARTED 5
character must be a letter or an underscore, the other characters can be letters, digits
or underscores (spaces or other characters are not allowed). It is better to avoid using
the name e for variables and beginning variable names with an underscore (all Stata
build-in variables begin with an underscore). Stata is case sensitive (xx, xX, Xx and XX
are all different names). There are a few reserved names that cannot be used: e.g. if,
in, int, with.
Variables can be renamed using the rename command. For example: rename x y
renames the variable x as y.
Associated with each type of variable is a storage type. The available storage types
are:
• String variables: str#, where # is an integer between 1 and 80 specifying
the number of characters in the string. The maximum length of a string is 80
characters.
A double occupies twice as much space as a float, a long occupies the same space
as a float, an int occupies half the space of a float, and a byte occupies half the
space of an int. Thus, a byte occupies 1/4 of the space of a float and 1/8 of the
space of a double, thereby allowing to store categorical and indicator variables very
efficiently.
The default for storing numeric variables is float, but Stata performs all internal
calculations in double.
The compress command may be used to automatically optimize the storage type of
the data in memory.
1.3.2 OBSERVATIONS
Observations correspond to the row of a data matrix. Stata automatically creates and
updates the build-in system variable _n, which is a counter containing the number of
the current observation, and the system macro _N, which contains the total number
of observation in the dataset.
Notice that when the sorting of the data changes, so does the counter _n.
where lblname is the name of a value label defined through the command
label define lblname # "string" [# "string" ...]
Example:
label define sexlbl 0 "male" 1 "female"
label value sex sexlbl
The label dir command lists the names of value labels stored in memory, label
drop lblnames eliminates the value labels lblnames, label drop _all eliminates all
value labels, whereas label list lists the names and contents of value labels stored
in memory.
Data files are labeled by using the command
label data string
where string is up to 80 characters long. Data labels are displayed when the data are
used or described. If no label is specified, any existing label is removed.
• relational operators: < (less than), > (greater than), <= (less or equal), >=
(greater or equal), == (equal), ˜= (not equal);
A variety of date and time-series functions are also available, as well as matrix functions
returning scalars (see Section 4.4.3).
varname contains numbers that merely happen to be stored as strings (e.g. the number
‘1,314’). In this case use instead
generate newvar = real(varname)
The decode command creates a new string variable named newvar based on the
“encoded” numeric variable varname and its value label.
Stata Commands
In this chapter I describe the syntax of some frequently used Stata commands. My
selection is of course subjective.
2.1.1 BY
Most Stata commands allow the by varlist: prefix. This causes command to be
repeated for each subset of the data for which the values of the variables in varlist are
equal. The use of by requires the data to be preliminarily sorted by varlist.
Example:
. sort x
. by x: summarize y
Not all commands allow the by varlist: prefix. Some replace it with by(groupvar) in
the options. For example, the syntax of the ttest command is:
16
2.1.2 WEIGHTS
The option weight indicates the weight to be attached to each observation. The syntax
of weight is [weightword = exp], where weighword is one either weight (the default
treatment of weights) or one of fweight, pweight, aweight and iweight, which
correspond to the four kind of weights that Stata understands (although not every
command supports all four of them):
The default treatment (weight) is each command’s idea of what the “natural” weights
are and is one of the above weight types.
2.1.3 IF AND IN
The if exp qualifier restricts the scope of the command to those observations for
which the value of the expression is true. For example:
. replace y = x+2 if x>0
The in range qualifier restricts the scope of the command to a specific observation
range, where range is any of #, #/#, #/l or f/#.
For example, to summarize the values of x and y for the first 10 observations:
. summarize x y in 10
. summarize x y in 1/10
. summarize x y in f/10
2.2.1 DESCRIBE
This command displays a summary of the contents of either the data in memory or
the data stored in a Stata-format dataset. Its syntax is
describe [varlist] [, short detail fullnames numbers]
in the first case, and
describe using filename [, short detail]
in the second case, where short suppresses the specific information about each
variable, detail includes more detailed information (the width of a single observation,
the maximum number of observations holding the number of variables constant, the
maximum number of variables holding the numbers of observations constant, the
maximum width for an observation, and the maximum size of the dataset), fullnames
displays the full names of the variables (the default is to present an abbreviation when
the variable name is longer than 15 characters), and numbers presents the variable
number along with the variable name. The numbers and fullnames options may not
both be specified together.
2.2.2 LIST
This command displays the values of variables. Its syntax is:
list [varlist] [if exp] [in range] [, [no]display nolabel noobs]
where [no]display forces the format into display or tabular (nodisplay) format (if one
of these two options is not specified, then Stata chooses one based on its judgment),
nolabel causes the numeric codes rather than label values to be displayed, and noobs
suppresses printing of the observation numbers.
Examples:
. list in 1/10
. list x y
. list x y in 1/10
. list if x>20
. list x y if z>20
. list x y z if z>20 in 1/10
drop if exp
drop in range [if exp]
The keep command works exactly the same as drop except that one specifies the
variables or observations to be kept.
Examples:
. drop in 1/33
. keep in 34/l (drop first 33 observations)
. drop in -10/l (drop last 10 observations)
. drop if x<21
. keep if x>=21
. sort y
. by y: keep if _n==_N
• rmax(varlist) gives the maximum value in varlist for each observation (row).
It may not be combined with by. The same syntax holds for a number of
other functions with argument varlist, such as rmean (row mean), rmin (row
minimum), rmiss (row number of missing values).
Examples:
. egen avgx = mean(x)
. gen dev = x-avgx
. egen x = median(x2-x1) (expression, - means subtraction)
. egen y = rmean(x1 x2 x3)
. egen y = rmean(x1-x3) (varlist, - means through)
. egen sdx = sd(x)
. egen stdx = std(x), mean(100) std(10)
. egen sumx = sum(x), by(y)
. egen xy = group(x y)
2.2.5 REPLACE
This command changes the contents of an existing variable. Its syntax is:
20
2.3.1 APPEND
This command appends a Stata-format dataset stored on disk to the end of the dataset
in memory. Its syntax is:
append using filename [, nolabel]
where nolabel prevents copying the value label definitions from the disk dataset. Even
if this option is not specified, label definitions from the disk dataset never replace
definitions already in memory.
If filename is specified without an extension, .dta is assumed.
2.3.2 MERGE
This command joins corresponding observations from the dataset currently in memory
(called the master dataset) with those from the Stata-format dataset stored as filename
(called the using dataset) into single observations (if filename is specified without an
extension, .dta is assumed). It can perform both one-to-one and match merges. Its
syntax is:
merge [varlist] using filename [, nolabel update replace
nokeep _merge(varname)]
where nokeep causes merge to ignore observations in the using data that have no
corresponding observation in the master (the default is to add these observations to
the merged result and mark them with _merge==2) and _merge(varname) specifies
the name of the variable that will mark the source of the resulting observation. The
default is _merge(_merge), which adds a new variable _merge to the data whose
values are:
_merge==1 (obs. from master data)
_merge==2 (obs. from using data)
_merge==3 (obs. from both master and using data)
Examples:
. use data1 (one-to-one merge)
. merge using data2
. tab _merge
. use data2 (match merge)
. sort x
22
2.4.1 COLLAPSE
This command replaces the data in memory with a new dataset consisting of the
means, medians, etc. of the specified variables. Its syntax is:
or any combination of the varlist or target_var forms, and stat is one of the following:
mean (means), sd (standard deviations), sum (sums), rawsum (sums ignoring optionally
specified weights), count (number of nonmissing observations), max (maxima), min
(minima), median (medians), p# (#th percentile), iqr (interquartile range). If stat is
not specified, mean is assumed.
The by(varlist) option specifies the groups over which the means, etc., are to be
calculated, cw specifies casewise deletion (if not specified, all observations possible are
used for each calculated statistic) and fast specifies that collapse not go to extra work
so that it can restore the original data should the user press Break.
2.4.2 CONTRACT
This command makes datasets of frequencies. It replaces the data in memory with a
new dataset consisting of all combinations of varlist that exist in the data together
with a new variable that contains the frequency of each combination. Its syntax is:
where freq(varname) specifies a name for the frequency variable (if not specified,
_freq is used, the name must be new), zero specifies that combinations with frequency
zero are wanted, and nomiss specifies that observations with missing values on any of
the variables in varlist will be dropped (if not specified, all observations possible are
used).
STATA COMMANDS 23
2.4.3 EXPAND
This command replaces each observation in the current dataset with n copies of the
observation, where n is equal to the integer part of the required expression (if the
expression is less than one or equal to missing, then it is interpreted as if it were one,
and the observation is retained but not duplicated). Its syntax is:
expand [=]exp [if exp] [in range]
Example:
. expand 2
2.4.4 FILLIN
This command rectangularizes a dataset by adding observations with missing data
so that all interactions of the variables in varlist exist. It also adds the variable
_fillin to the data (with value 1 for created observations and 0 for previously existing
observations). Its syntax is:
fillin varlist
2.4.5 RESHAPE
This command converts data from wide to long form and vice versa. Its basic syntax
is:
reshape wide varnames, i(varlist) [j(varname) string]
reshape long varnames, i(varlist) [j(varname) string]
where i(varlist) specifies the variable(s) whose unique values denote a logical
observation, j(varname) specifies the variable whose unique values denote a
subobservation, and string specifies that the j() may contain string values.
Examples:
. reshape long x1 x2, i(y) j(z) (converts from wide to long)
. reshape wide (converts back to wide)
. reshape ..., i(z) (single i() variable)
. reshape ..., i(z1 z2) (two i() variables)
. reshape long x, i(y) j(z 1-3 5) (specifying j() values)
. reshape long x, i(y) j(z) string (allow string variables in j())
2.5.1 COUNT
This command counts observations satisfying the specified conditions. Its syntax is:
24
2.5.2 SUMMARIZE
This command reports a variety of univariate summary statistics. Its syntax is:
summarize [varlist] [weight] [if exp] [in range] [,
{detail|meanonly} format]
where detail produces additional statistics (including skewness, kurtosis, the four
smallest and four largest values, along with various percentiles), meanonly suppresses
display of the results and calculation of the variance (it is allowed only when detail
is not specified) and format requests that the summary statistics be displayed using
the display format associated with the variables rather than the default g format.
2.5.3 MEANS
This command reports the arithmetic, geometric, and harmonic means, along with
their respective confidence intervals, for the specified variables. Its syntax is:
means [varlist] [if exp] [in range] [, add(#) only level(#)]
where add(#) adds the value # to each variable in varlist before computing the means
and confidence intervals (this may be useful when analyzing variables with nonpositive
values), only modifies the action of the add() option (if specified, the add() option
only adds # to variables with at least one nonpositive value) and level(#) specifies
the percentage confidence level for confidence intervals.
The ci command may be used if one simply wants arithmetic means and corresponding
confidence intervals.
2.5.4 CENTILE
This command reports the (per)centiles of the specified variables and their confidence
intervals. By default, confidence intervals are obtained using a binomial method that
makes no assumptions as to the underlying distribution of the variable. The syntax is:
centile [varlist] [if exp] [in range] [, centile(numlist) cci
normal meansd level(#)]
where centile(numlist) specifies the centiles to be reported, for example centile(25
50 75) (if not specified, medians are reported), cci (conservative confidence interval)
prevents centile from interpolating when calculating the distribution-free (binomial-
based) confidence limits, normal specifies that confidence intervals are to be obtained
assuming that both the data and the centiles are normally distributed, meansd
STATA COMMANDS 25
calculates confidence intervals assuming that the estimated centiles themselves are
normally distributed.
The related command
pctile newvar = exp
creates a new variable containing the percentiles of exp, where exp is typically just
another variable.
2.5.5 CUMUL
This command creates a new variable containing the empirical distribution function
(edf) of a variable. Its syntax is:
cumul varname [weight] [if exp] [in range] , gen(newvar) [freq
by(varlist)]
where gen(newvar) specifies the name of the new variable to be created (it is not
optional), freq requests the edf to be in frequency units; otherwise it is normalized so
that newvar is 1 for the largest value of varname, and by(varlist) specifies that edf’s
be generated separately for each by-group.
2.5.6 CORRELATE
This command reports the covariance or correlation matrix of the specified variables.
Observations are excluded from the calculation due to missing values on a casewise
basis. The syntax is:
correlate [varlist] [weight] [if exp] [in range] [, means
noformat covariance wrap]
where means causes summary statistics (means, standard deviations, minima and
maxima) to be displayed along with the matrix, noformat displays the summary
statistic requested by the means option in g format regardless of the display formats
associated with the variables, covariance displays the covariances rather than the
correlation coefficients, and wrap requests that no action be taken on wide matrices
to make them readable.
2.5.7 REGRESS
This command estimates linear regression models with a single response or dependent
variable. Estimation is carried out by least squares (either ordinary least squares or
weighted least squares). Its basic syntax is:
regress yvar [xvars] [weight] [if exp] [in range] [, level(#)
noconstant regress_options]
where level(#) specifies the confidence level (in percent) for the regression
parameters (the default is 95%), noconstant suppresses the constant term (intercept)
in the regression, and the additional regress_options are described in more detail in
Section 6.1.1.
26
2.6 TABLES
Stata offers three basic commands for producing tables: table, tabulate and
tabulate, summarize.
2.6.1 TABLE
This command provides tables of summary statistics. Its syntax is a little involved:
table rowvar [colvar [supercolvar]] [weight] [if exp] [in range]
[, contents(clist) by(superrow_varlist) cw row col scol
format(%fmt) center left concise missing replace name(string)
cellwidth(#) csepwidth(#) scsepwidth(#) stubwidth(#)]
where contents(clist) specifies the content of the table’s cells (up to 5 statistics may
be specified, if contents() is not specified it is assumed to be contents(freq)), clist
is as in collapse, row specifies a row is to be added to the table reflecting the total
across rows, col specifies a column is to be added to the table reflecting the total
across columns, format(%fmt) specifies the display format for presenting numbers in
the table’s cells, center specifies results are to be centered in the table’s cells (the
default is to right align), left specifies that column labels are to be left aligned (the
default is to right align), missing specifies that missing statistics are to be shown in
the table as periods (the default is to leave them blank). See the on-line help or the
Reference Manual for a description of the other options.
2.6.2 TABULATE
This command provides one- and two-way tables of frequency counts along with
various measures of association, including the common Pearson chi-squared, the
likelihood ratio chi-squared, Cramer’s V, Fisher’s exact test, Goodman and Kruskal’s
gamma, and Kendall’s tau-b.
The syntax for one-way tables is:
tabulate varname [weight] [if exp] [in range] [, generate(varname)
matcell(matname) matrow(matname) missing nofreq nolabel
plot subpop(varname)]
The syntax for two-way tables is:
tabulate varname1 varname2 [weight] [if exp] [in range] [,
all cell chi2 column exact gamma lrchi2 matcell(matname)
matcol(matname) matrow(matname) missing nofreq nolabel
row taub V wrap]
where all is equivalent to specifying chi2 lrchi2 V gamma taub, cell displays
the relative frequency of each cell in a two-way table, chi2 calculates and displays
Pearson’s chi-squared for the hypothesis that the rows and columns in a two-way
table are independent, column displays in each cell of a two-way table the relative
frequency of that cell within its column, exact displays the significance calculated
by Fisher’s exact test, gamma displays Goodman and Kruskal’s gamma along with
STATA COMMANDS 27
2.6.3 TABSUM
The tabulate, summarize() command produces one- and two-way tables of
summary statistics. Although table is better, tabulate, summarize() is faster. Its
syntax is:
tabulate varname1 [varname2] [weight] [if exp] [in range] ,
summarize(varname3) [[no]means [no]standard [no]freq [no]obs
wrap nolabel missing]
where summarize(varname3) identifies the name of the variable for which summary
statistics are to be reported (if this option is not specified, then a table of frequencies
is produced), [no]means includes only or suppresses only the means from the table
(the summarize() table normally includes the mean, standard deviation, frequency,
and, if the data is weighted, the number of observations), [no]standard includes only
or suppresses only the standard deviations from the table, [no]freq includes only or
suppresses only the frequencies from the table, [no]obs includes only or suppresses
only the reported number of observations from the table, and missing requests that
missing values of varname1 and varname2 be treated as categories rather than as
observations to be omitted from analysis.
Examples:
. tabulate y, summarize(z) (one-way table)
. tabulate y1 y2, summarize(z) (two-way table)
. sort x (n-way table)
. by x: tabulate y1 y2, summarize(z) means nofreq
3
Graphics
Stata graphics are not very fancy, but allow considerable flexibility and are relatively
simple to use.
After discussing some options of the graph command that are common across all styles
(Section 3.2), I shall focus on the first four styles.
• Printing a graph: after the graph command, use the Print button in the Stata
toolbar.
• Specifying titles: graph allows up to two titles on every side of the graph (top,
bottom, left and right), denoted by the options t1, t2, b1 (same as title or
ti), b2, l1, l2, r1, and r2. The first title (e.g. t1) is always the farther from
the figure, the second (e.g. t2) is the closest. The argument of each option
is some text enclosed in quotes. Quotes can be omitted if text contains no
special character.
Example:
. graph y x, l1(y) b2(x) title("Figure 1: x-y scatterplot")
• Setting the gap: gap(#) sets the amount of space between the left title and
the values along the y-axis. The default is gap(8).
• Labeling axes: by default, graph labels just the minimum and maximum
of each variable. More aesthetically pleasing results may be obtained
with the options {x|y|r|t}label[(#,...,#)]. Typed without arguments,
{x|y|r|t}label chooses “round” values to be labelled.
• Adding lines: lines across the graph may be drawn with the options
{x|y|r|t}line[(#,...,#)]. The yline and rline options draw horizontal
lines, xline and tline draw vertical lines.
GRAPHICS 31
Example:
. graph y x, gap(4) xlabel ylabel(0,1) ytick(.5) yline(0,.5,1)
• Setting the scale: by default, graph scales each axis according to the minimum
and maximum of all things that go on the axis (data, labeling or ticking). The
options {x|y|r}xscale(#,#) may be used to widen (but never to narrow)
the scale used for drawing a graph on any style that has an axis.
• Setting the axes rendition: by default, graph draws an axis on any style that
has an axis. The border option replaces axes with borders. The noaxis option
suppresses both axes and borders.
• Creating log scales: the log option is used with the histogram style, the
{x|y|r}log options with the twoway style.
• Plotting symbols: graph uses the following plotting symbols to specify the
location of a point on a scatterplot: O (large circle, default for twoway), S
(large square), T (large triangle), o (small circle, default for twoway with by, or
matrix), d (small diamond), p (small plus), . (dot), i (invisible), [varname]
(variable to be used as text), _n (observation number).
The sequence of plotting symbols for the variables in varlist is specified with
the option symbol(s . . . s), where s is any of the above symbols. By default,
twoway chooses the symbols O, T, S, and the remainder . if symbol is not
specified. Combined with by(), it chooses instead o, p, d, and the remainder
. by default.
• Line patterns: For each line type, one can specify the pattern of the line by
adding a [pattern] after the line type, where pattern is any combination of
the following: l (a solid line, the default), _ (a long dash), - (a medium dash),
. (a short dash, almost a dot) and # (a space).
Example:
. graph y1 y2 y3 x, symbol(...) connect(ll[_]l[-])
32
The set textsize # command controls the size of the text used in a graph. This is
not an option but a separate command that must be issued before graph.
I refer to the Graphics Manual for other common options and to the remainder of this
chapter for options specific to the various graph styles.
3.3 HISTOGRAMS
This is the default for graph when only one variable is specified. The basic syntax is:
where bin(#) specifies the number of (equally spaced) bins to use for constructing
the histogram (the default is bin(5)), freq and percent affect how the vertical
axis is labeled (respectively, in frequency units and in percent), normal[(#,#)]
overlays a normal density with specified mean and standard deviation (normal by
itself uses the observed mean and standard deviation), and density(#) (only used
with normal) specifies the number of points along the density to be calculated (the
default is density(100)).
Examples:
where rescale scales each y-variable independently (if there are two y-variables, the
scale of the first is presented on the left axis and the scale for the second on the right
axis; if there are more than two y-variables, no vertical scale is labeled), rbox places a
rangefinder box plot on the graph, and {y|x|r}reverse reverses the indicated scale
to run from high-to-low.
Examples:
4.1.1 MACROS
A macro is a user-defined string of characters, called the macro name, that stands for
another string of characters, called the macro content.
Stata has two types of macros, local and global. Local macros are private, that is,
specific to the program where they are defined, global macro are public. Their content
is set respectively by the local and global commands. Their general syntax is
{local|global} mname [[`]"[string]"[´]|= exp|: extended_fcn]
where the macro name mname can be up to 7 character long for local macros and
up to 8 characters for global macros, and exp may be either a numeric or a string
expression. For the use of a extended macro function see the Stata manual.
36
To copy string to mname (the maximum length of string is 18,623 characters) use:
{local|global} mname "string"
To evaluate exp and store the result in mname (the maximum length of exp is 80
characters) use:
{local|global} mname = exp
Macros can be used everywhere in programs, do-files and ado-files (see below). The
content of a local macro is accessed by enclosing the macro in `´, that of a global
macro by prefixing it with a $. This simply replaces the name of the macro with its
content.
Examples:
local options "gap(4) sy(i) xlab(10) ylin"
graph x y, `options´
sort z
global options "by(z) gap(4) sy(.) c(l) xlab ylab"
graph x y, $options
If a macro contains double quotes, compound double quotes `""´ may be used to
define a macro.
Typing macro drop mname eliminates the global macro mname. Typing macro drop
_all eliminates all global macros.
4.1.3 LOOPING
Stata provides two commands for looping, while and forvalues. The syntax of while
is simpler:
while exp {
Stata commands
}
PROGRAMMING AND MATRIX COMMANDS 37
This command evaluates exp and, if it is true (nonzero), executes the commands
enclosed in the braces. It then repeats the process until exp evaluates to false (zero).
whiles may be nested within whiles. If exp refers to any variables, their values in
the first observation are used unless explicit subscripts are specified.
Example: The following code fragment may be used to iterate Stata commands 10
times
local i = 1
local I = 10
while `i´<=`I´ {
Stata commands
local i = `i´+1
}
The while command may also be used interactively.
4.1.4 BRANCHING
The syntax of this programming command is:
if exp {
Stata commands
}
else { other Stata commands }
This command evaluates exp. If the result is true (nonzero), the commands inside the
braces are executed. If the result is false (zero), those statements are ignored and the
statements following the else, if specified, are executed.
Do not confuse this command with the if qualifier at the end of a command.
Example:
if x>0 {
replace y = log(x)
}
else if x<0 {
replace y = log(-x)
}
else {
replace y = .
}
local i = `i´+1
}
end
Typing return list after an r-class command or estimates list after an e-class
(estimation class) command summarizes what the command saved.
Results saved in r() and e() come in three flavors: scalars, macros and matrices. For
example, the number of observations used by a command are saved in the scalars r(N)
or e(N). After an e-class command, the command name and the name of the response
(dependent) variable are saved in the macros e(cmd) and e(depvar), whereas the
estimated coefficients and their variance matrix are saved in the matrices e(b) and
e(V).
Regardless of their flavor, one may refer to saved results in two ways. One is just by
simply typing r(name) or e(name). The other is to use macro substitution characters
to produce `r(name)´ or `e(name)´.
40
4.2 DO FILES
A do-file is a standard ASCII (text) file containing a sequence of Stata commands, a
separate command on each line. The sequence of commands is executed using the do
or run commands, whose syntax is
{do|run} filename [arguments] [, nostop]
where nostop allows the do-file to continue executing even if an error occurs. If
filename is specified without an extension, .do is assumed.
The difference between do and run is that do echos the commands and their output,
while run is silent.
A do-file completes the execution when: (i) the end of the file is reached, (ii) an exit is
executed, or (iii) an error (nonzero return code) occurs (pressing Break while executing
a do-file causes a nonzero return code and therefore stops the do-file).
A do-file may be used to define one or more programs or may call programs already
defined. Do-files may also call other do-files. As for programs, Stata allows do-files to
be nested 32 deep.
Here are some rules and recommendations for constructing a do-file.
• Start a do-file by typing version #, where # is the Stata release under which
the file was written. This allows the do-file to run under later releases.
• Blank lines and comments may be included freely. Their proper use may
considerably enhance understanding of a program.
PROGRAMMING AND MATRIX COMMANDS 41
• To avoid lines wider than the screen, the end-of-line delimiter may be changed
from carriage return to, say, ‘;’ by including the #delimit ; command. The
delimited may later be changed back to carriage return by including the
#delimit cr command.
• The output of a do-file may be sent to a log file by including the command
log using filename [, replace]. Logging stops and the log file is closed
when log close is encountered. Output to the log file is suppressed if run is
used to execute a do-file.
• To prevent Stata from pausing when the screen is full, include the set more
off command.
Do-files accept arguments, just like programs. Arguments are stored in local macros
`1´, `2´, and so on. For example, to repeat the same set of instruction for different
variables one could write the do-file try.do
use mydata, clear
drop if `1´==.
summarize `2´, detail
and then execute it by typing
. do try x y
The second command (drop if `1´==.) would be interpreted as drop if x==.
because x is the first argument typed after do try, the third command (summarize
`2´) would be interpreted as summarize y because y is the second argument typed
after do try.
Ado-files typically come with an associated help-file. Typing help ci (or pulling down
Help and searching for ci), prompts Stata to look for the file ci.help, just as it does
for the file ci.ado after the command ci is typed.
Stata looks for ado-files (and the associated help-files) in several places: the official ado-
directories (the base directory and the updates directory), the personal ado-directories,
the current directory.
• B’ (transposition),
• B + C (addition),
• B - C (subtraction),
• B # C (Kronecker product).
Parentheses may be used to enforce a particular order of evaluation.
Examples
. matrix C = (B + B’)/2
The matrix functions returning scalar are:
Matrix functions returning scalar may be used in any expression context, not just
matrix expression contexts.
The matrix functions returning matrix are:
Examples
. matrix beta = syminv(X’*X)*X’*y
. display trace(X)
. matrix L = cholesky(0.1*I(rowsof(X)) + 0.9*X)
There are matrix utilities to list the currently defined matrices (matrix dir), display
the contents of a matrix (matrix list), rename a matrix (matrix rename) and drop
a matrix (matrix drop). The matrix drop _all command drops all matrices.
PROGRAMMING AND MATRIX COMMANDS 45
• To review the last estimates, just type the estimation command without
arguments.
• After estimation, one may obtain the estimated variance matrix of the
estimators using the vce command (Section 5.2.2).
• After estimation, one may obtain prediction, residuals and influence statistics
using the predict command (Section 5.2.3).
• After estimation, one can perform tests of hypotheses about the model
parameters (Section 5.2.4).
48
4. Adding the nooffset option to any of the above makes the calculation
ignoring any offset or exposure variable specified in the estimation command.
50
8. Some statistics make sense only with respect to the estimation sample. In such
cases, the calculation is automatically restricted to the estimation sample.
• The nooffset option may be combined with most statistics and specifies
that the calculation should be made ignoring any offset or exposure variable
specified when the model was estimated. This option is available even if not
documented for predict after a specific command. If neither the offset()
STATISTICAL INFERENCE USING STATA 51
nor the exposure() option was specified at the model estimatio stage,
specifying nooffset does nothing.
5.3.1 BOOTSTRAP
The command
bstrap progname [, reps(#) size(#) dots args(...) level(#)
cluster(varnames) idcluster(newvarname) saving(filename)
double every(#) replace noisily]
runs the user-defined program progname reps(#) times on bootstrap samples of size
size(#).
The command
bs "command" "exp_list" [, bstrap_options nowarn noesample]
runs the user-specified command bootstrapping the statistics specified in exp_list.
The expressions in exp_list must be separated by spaces and there must be no spaces
within each expression. Note that command and exp_list must both be enclosed in
double quotes. This command takes the same options as bstrap except for args().
The command
bstat varlist [, stat(#) level(#) title(text)]
displays bootstrap estimates of standard error and bias, and calculates confidence
intervals using three different methods: normal approximation, percentile, and bias
corrected. The bstrap and bs commands automatically run bstat after completing
all the bootstrap replications. If the user specifies the saving(filename) option with
bstrap or bs, then bstat can be run on the data in filename to view the bootstrap
estimates again.
Finally, the command
bsample [exp] [, cluster(varnames) idcluster(newvarname)]
52
is a low-level utility for those who prefer not to use bstrap or bs. It draws a sample
with replacement from the existing data; the sample replaces the data in memory;
exp specifies the size of the sample and must be less than or equal to _N. If exp is
not specified, a sample size of _N is drawn (or size n_c when the cluster() option is
specified where n_c is the number of clusters).
In addition to predict, the following commands can be used after regress for
diagnosing sensitivity to individual observations:
6.2.1 GLM
The glm command fits generalized linear models. Estimation is carried out either by
iteratively reweighted least squares (IRLS) or by using the Newton-Raphson (NR)
method, which is the default.
The basic syntax is:
glm yvar [xvars] [weight] [if exp] [in range] [, max_options
var_options output_options spec_options]
The max_options are:
iterate(#) ltolerance(#) mu(varname) nolog search fisher(#) irls
where iterate(#) specifies the maximum number of iterations allowed in estimating
the model (iterate(50) is the default), ltolerance(#) specifies the convergence
criterion for the change in deviance between iterations (ltolerance(1e-6) is the
default), mu(varname) specifies varname as the initial estimate for the mean of yvar,
nolog suppresses the iteration log, search specifies that the command should search
for good starting values, fisher(#) specifies the number of NR steps that should
use the Fisher scoring Hessian or expected information matrix before switching to the
observed information matrix (both search and fisher() are only useful with NR
optimization, not with IRLS), and irls requests IRLS minimization of the deviance
instead of NR maximization of the log-likelihood.
The var_options are:
oim opg vfactor(#) robust cluster(varname) unbiased
nwest(wtname [#]) jknife jknife1 bstrap brep(#)
scale(x2|dev|#) disp(#) score(newvar) t(varname)
where oim specifies that the variance matrix should be calculated using the observed
information matrix rather than the usual expected information matrix (option ignored
if irls is not specified), opg specifies that the variance matrix be calculated using
the Berndt, Hall, Hall, and Hausman (1976) variance estimator (this option is not
allowed when cluster() is specified), vfactor(#) specifies a scalar by which to
multiply the resulting variance matrix, robust and cluster(varname) have already
been defined, unbiased specifies that the unbiased sandwich estimate of variances
be used (robust is implied when unbiased is used), nwest(wtname [#]) specifies
that a heteroskedasticity and autocorrelation consistent variance estimate be used,
jknife and jknife1 specify that jackknife estimates of variance be used, bstrap
specifies that the bootstrap estimate of variance be used, brep(#) specifies the
number of bootstrap samples to consider in forming the bootstrap estimate (the
default is brep(199)), scale(x2|dev|#) overrides the default scale parameter (by
56
default, scale(1) is assumed for discrete distributions and scale(x2) for continuous
distributions), scale(x2) specifies the scale parameter be set to the Pearson chi-
squared (or generalized chi-squared) statistic divided by the residual degrees of
freedom, scale(dev) sets the scale parameter to the deviance divided by the
residual degrees of freedom (this provides an alternative to scale(x2) for continuous
distributions and over- or under-dispersed discrete distributions) scale(#) sets the
scale parameter to #, disp(#) multiplies the variance of yvar by # and divides
the deviance by #, score(newvar) creates the new variable newvar containing each
observation’s contribution to the score, and t(varname) specifies the variable name
corresponding to the time index (this option is required if nwest() is specified).
The output_options are:
eform level(#) trace noheader nodisplay nodots
where eform displays the exponentiated coefficients and corresponding standard errors
and confidence intervals (for binomial models with the logit link, exponentiation results
in odds ratios; for Poisson models with the log link, exponentiated coefficients are
rate ratios), trace requests that the estimated coefficient vector be printed at each
iteration, noheader suppresses the header information from the output (the coefficient
table is still printed), nodisplay suppresses the output (the iteration log is still
displayed), and nodots specifies that a dot should not be printed for each fitted model
when calculating jackknife or bootstrap estimates (by default, a single dot character
is printed for each estimation that is performed).
The spec_options are:
family(familyname) link(linkname) noconstant [ln]offset(varname)
where family(familyname) specifies the parametric family, link(linkname) specifies
the link function, noconstant specifies that the linear predictor has no intercept term,
and [ln]offset(varname) specifies an offset to be added to the linear predictor.
familyname is either a user-written program or one of: binomial (Bernoulli/binomial),
gamma, gaussian, igaussian (inverse Gaussian), nbinomial (negative binomial),
poisson.
linkname is either a user-written program or one of: cloglog (complementary log-log),
identity, log, logit, logc (log-complement), loglog (log-log), nbinomial (negative
binomial), opower # (odds power), power #, probit.
If family() is specified but not link(), then the canonical link for the family is
obtained, namely:
6.3.5 BIPROBIT
The biprobit command produces ML estimates of two-equation probit models, either
a bivariate probit or a system of two seemingly unrelated probit equations.
survival time, the default), median lntime, (predicted median log survival time),
mean time (predicted mean survival time), mean lntime (predicted mean log survival
time), hazard (predicted hazard), hr (predicted hazard ratio), surv (predicted survival
probability), csnell (partial Cox-Snell residuals), or mgale (partial martingale-like
residuals).
• arch() (ARCH),
Some of the xt commands use time-series operators in their internal calculations and
thus require the data to be tsset.
• be (between-group estimator),
Notice that the xtlogit, fe command is equivalent to the clogit command. Also
notice that xtlogit, pa corresponds to
xtgee, family(binomial) link(logit) corr(exchangeable)
whereas xtprobit, pa corresponds to
xtgee, family(binomial) link(probit) corr(exchangeable)
Berndt E., Hall B., Hall R., and Hausman J.A. (1974) Estimation and Inference in Nonlinear
Structural Models. Annals of Economic and Social Measurement, 3/4: 653—665.
Breusch T.V. and Pagan A.R. (1980) The Lagrange Multiplier Test and Its Applications to
Model Specification in Econometrics. Review of Economic Studies, 47: 239—254.
Cleveland W.S. (1979) Robust Locally Weighted Regression and Smoothing Scatterplots.
Journal of the American Statistical Association, 74: 829—836.
Gould W. and Scribney W. (1999) Maximum Likelihood Estimation with Stata, Stata
Corporation, College Station, TX.
Hausman J.A. (1978) Specification Tests in Econometrics. Econometrica, 46: 1251—1272.
Heckman J.J. (1976) The Common Structure of Statistical Models of Truncation, Sample
Selection and Limited Dependent Variables and a Simple Estimator for Such Models.
Annals of Economic and Social Measurement, 5: 475—492.
Hosmer D.W. and Lemeshow S. (1989) Applied Logistic Regression, Wiley, New York.
Liang, K.Y. and Zeger S.L. (1986) Longitudinal Data Analysis Using Generalized Linear
Models. Biometrika, 73: 13—22.
McCullagh P. and Nelder J.A. (1989) Generalized Linear Models (2nd ed.), Chapman and
Hall, London.
Newey W.K. and West K. (1987) A Simple, Positive Semi-Definite, Heteroskedasticity and
Autocorrelation Consistent Covariance Matrix. Econometrica, 55: 703—708.
Peracchi F. (2001) Econometrics, Wiley, Chichester, UK.
Pregibon D. (1981) Logistic Regression Diagnostics. Annals of Statistics, 9: 705—724.