Vous êtes sur la page 1sur 63

Training Module 1: Using Stata for Survey Data Analysis

Project: Poverty mapping and market access in Vietnam Funding: New Zealand Embassy with coordination by The World Bank Implementation: International Food Policy Research Institute (IFPRI) and the Institute for Development Studies (IDS) Lead Trainer: Nicholas Minot, IFPRI Dates: 5-9 August 2002 Host institutions: Information Center for Agriculture and Rural Development Ministry of Agriculture and Rural Development with the Ministry of Labor, Invalids, and Social Affairs and the Ministry of Planning and Investment Hanoi, Vietnam

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Background This is the first of three one-week training modules offered as part of the project Poverty mapping and market access in Vietnam. The project is funded by the Embassy of New Zealand and implemented by the International Food Policy Research Institute (IFPRI) in Washington, D.C. and the Institute for Development Studies (IDS) in Sussex, England. The training modules will cover the following topics: 1. Using Stata for survey data analysis 2. Introduction to geographic information systems (GIS) 3. Poverty mapping methods: Combining census and survey data Four characteristics of these modules need to be emphasized because they have implications for the role of the participants. The training modules are not lecture courses, but rather they are semi-structured hands-on workshops in which trainees will use computers to learn different methods of analyzing data. Thus, active participation of the trainees is expected and necessary to maximize the benefit from the training. The training modules focus on how to use computer software to implement a wide range of topics and analytical methods. In order to cover this range of methods, the course cannot provide detailed explanations of the statistical methods themselves, so it is assumed that trainees have some familarity with concepts such as means, frequency distributions, and regression analysis. The training modules are cumulative in the sense that understanding the material of one day depends on having attended the training course the day before. If you cannot attend the course every day for the full day, it will be difficult to understand the new materials. For this reason, we will ask those who cannot attend regularly to withdraw to make space for other trainees. The training modules will be offered in English. Trainees are not expected to understand all the technical terms used in the course, but they should have a solid understanding of conversational English in order to take full advantage of the training.

At the end of each module, we will issue Certificates of Completion to each trainee who has attended all the sessions and mastered the concepts taught in the course. We reserve the right not to issue Certificates to trainees who do not attend all sessions and those who do not master the material taught. Objectives The objective of this training module is to improve the ability of the trainees to use Stata to generate descriptive statistics and tables from survey data, as well as carry out multiple linear regression analysis of those data. In particular, the course aims to train the participants in the following methods: basic file management such as opening, modifying, and saving files advance file management such as merging, appending, and aggregating files documenting data files with variable labels and value labels generating new variables using various functions and operations creating tables to describe the distribution of continuous and discrete variables creating tables to describe the relationships between two or more variables using regression analysis to study the impact of various variables on a dependent variable testing hypotheses using statistical methods

N. Minot

Page 1-1

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Course requirements In order to take full advantage of the materials taught in the course, trainees must have the following background: Conversational English that allows them to follow the instructions of the trainer Basic statistics such as familiarity with the concepts of means, variance, frequency distributions, and regression analysis Familiarity with computers, including the keyboard and mouse

Organization of the course The training course is divided into ten sections. We will cover some material in all 10 sections, but we may not be able to cover all the material, depending on the background of the trainees. Section 1: Introduction to survey data files Section 2: Introduction to Stata Section 3: Exploring data files with Stata Section 4: Saving and using Stata output Section 5: Creating new variables Section 6: Making tables to describe data Section 7: Making graphs Section 8: Modifying data files Section 9: Introduction to programming with Stata Section 10: Regression analysis with Stata Each section will include some training in the use of Stata commands and a practical application of these commands to the analysis of the 1998 Vietnam Living Standards Survey (VLSS). The VLSS contains over one hundred files, but we will focus our attention on the following files:
Table 1. Sample data programs from the 1998 VLSS

Questionnaire section Extraced from various Section 1A Section 2 Section 6A Section 6B p1 Section 6B p2 Section 6C Section 9B1 Section 9B2 Section 9B4

Topic Household characteristics List of household members Education Type of housing Housing expenses Housing expenses Housing characteristics Rice production Other food crop production Perennial cash crop production

Level Household Individual Individual Household Household Household Household Crop Crop Crop

File name hhexp98n.dta scr01a2.dta scr02a.dta scr06a.dta scr06b1.dta scr06b2.dta scr06c.dta scr09b1.dta scr09b2.dta scr09b4.dta

N. Minot

Page 1-2

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 1: INTRODUCTION TO SURVEY DATA FILES 1. List of useful terms

The following are some key concepts that will be used throughout this training module. Most of you will be familiar with them, but it is worth reviewing the terms for those that may not know all of them.. Records (or cases or observations) are individual observations such as individuals, farm plots, households, villages, or provinces. They are usually considered to be the rows of the data file. For example, data set A (below) has 5 records and data set B has 6 records. The VLSS files usually have between 6000 and 120,000 records. Variables are the characteristics, location, or dimensions of each record. They are considered the columns of the data file. In data set A (below), there are four variables: the household identification number, the region where the household lives, the size of the household, and the distance from the house to the nearest source of water. In data set B, there are six variables: the region, province, household, plot number, whether or not it is irrigated, and the size of the plot. The VLSS files usually have between 10 and 30 variables The level of the dataset describes what each record represents. For example, In data set A (below), each record is a different household, so it is a household-level data set. In data set B (below), each record is a farm plot, it is a plot-level data set. Note that more than one record has the same household identification number. REG 1 1 1 2 3 HHSIZE 5 5 4 2 8 DISTWAT 1.5 0.4 0.6 5.1 1.2

Data set A HHID 3456 3457 3458 3459 3460 Data set B REG 1 1 1 2 2 3

PROV 4 4 5 26 26 45

HH 1 1 3 2 2 1

PLOT 1 2 1 1 2 1

IRRIG 1 0 1 0 1 1

AREA 1.5 1.0 0.5 0.4 1.0 1.2

Key variables are the variables that are needed to identify a record in the data. In data set A, the variable HHID is enough to uniquely identify the record so HHID is the only key variable. In data set B, the key variables are REG, PROV, HH, and PLOT because all four variables are needed to uniquely identify the record. The first two records have the same region, province, and household, so these three variables are not enough to uniquely identify a record.

N. Minot

Page 1-3

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Discrete variables (or categorical variables) are variables that have only a limited number of different values. Examples include region, sex, income category, type of roof, and education level. Yes/no variables such as whether a household has electricity are also discrete variables. Binary variables (or dummy variables) are a type of discrete variable that only takes two values. They may represent yes/no, male/female, have/dont have, or other variables with only two values. Continuous variables are variables whose values are not limited. Examples include income, farm size, number of trees, rice consumption, coffee production, and distance to the road. Unlike discrete variables, continuous variables are usually expressed in some units such as Vietnamese dong, kilometers, hectares, or kilograms and may take fractional values (4.5639). Variable labels are longer names associated with each variable to explain them in tables and graphs. For example, the variable label for HHSIZE might be Household size and the label for DISTWAT could be Distance to water (km). Whenever possible, variable labels should include the unit (e.g. km). Value labels are longer names attached to each value of a variable. For example, if the variable REG have eight values, each value is associated with a name. REG=1 could be Northeast Region, REG=2 could be the Northwest Region, and so on. 2. Structure of 1998 VLSS data files

The Vietnam Living Standards Survey was carried out in 1992-93 and 1997-98. In this section, we describe the 1997-98 VLSS, although most of the description fits the earlier survey as well since the questionnaire and data files are quite similar. The 1998 VLSS had three types of questionnaires: a household questionnaire, a community questionnaire, and a price questionnaires. Here we focus on the household questionnaire. Household questionniare The household questionnaire consists of 116 files and about 60 Mb of data (in Stata format). The files cover the following topics: Section 1: Section 2: Section 3: Section 4: Section 5: Section 6: Section 7: Section 8: Section 9: Section 10: Section 11: Section 12: Section 13: Section 14: Household members Education Health Employment Migation Housing Respondents for 2nd round Fertility Agriculture, forestry, and fishery activities Non-farm self-employment Food expenditures Non-food expenditures and durable goods Income from remittances Borrowing, lending, and savings

Each file contains the data for on section or sub-section of the questionnaire, usually covering several pages of the questionnaire. The file names include the section number, the part letter, and sometimes a number indicating the sub-part. For example, in the file

N. Minot

Page 1-4

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

scr09B4.dta 09 refers to Section 9, B refers to Part B, 4 refers to the 4th sub-part of Part B, and .dta is the file extension for Stata data files. Section 9 covers agriculture, Part B covers crop production, and Part B4 covers permanent industrial crops such as tea, coffee, and rubber. Within each file, the variables are named according to the section and question number. For example, in the variable: s9b4q031 s9b4 refers to Section 9, Part B4, q02 refers to question 2, and 1 refers to the 1st column within the question The variable s9b4q031 gives the area planted with a given crop, expressed in terms of hectares or number of trees. The next variable, s9b4q032, indicates whether the area is expressed in hectares or trees.

N. Minot

Page 1-5

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 2: INTRODUCTION TO STATA When you open Stata, you will see a menu bar across the top, a tool bar with buttons, and 3-5 windows (the number of windows open depends on which windows were open the last time Stata was used). Each is described briefly below. 1. Menu bar

The menu bar has lists of commands that can be opened by clicking on a word. Below we provide a quick description of the different options. If you use Stata a lot, you probably will not use the menu bar often because the most common tasks can be done with the buttons on the tool bar and keystrokes. File Open View Save Save as File name Log Save graph Print graph Print results Exit Edit Copy text Copy tables Paste Table copy options Graph copy options Prefs Copy marked text (Control-C can also be used to copy) Copy tables to insert in spreadsheet or word processor Insert something previously copied (Control-V will also paste) Options for how tables are copied Options for how graphs are copied (not in Stata 7) Various options for setting preferences. For example, you can save a particularly layout of the different Stata windows or change the colors used in Stata windows. Bring output window to front Bring graph window to front Bring log window to front Open help window (only in Stata 7) Bring command window to front Bring list of recent commands to front Bring list of variables to front Open help window (not in Stata 7) Open window to look at data Open window to write a new program (Do file) or edit an existing Open data file View data file (only in Stata 7) Save data file Save data file under new name Select data file name to put in command Open, close, review, or convert log file Save file with graph Print graph Print contents of current window (only in Stata 7) Leave Stata

Window Results Graph Log Viewer Command Review Variables Help/search Data editor Do-file editor one Help

Contents Information on Stata organized by topic Search Search for information on a certain topic Stata command Search for information on certain Stata command Whats new Differences between different versions of Stata other options allow you to access web sites with Stata news and information

N. Minot

Page 1-6

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

2.

Tool bar

The buttons on the tool-bar are designed to make it easier to carry out the most common tasks. The left column describes the button on the toolbar, while the right column tells what the button does. Open folder Diskette Printer Scroll with traffic light Scroll without light Eye Box 1 Box 2 Box 3 Envelope one Table Table and circle Go X 3. Stata windows Use data file Save data file in memory to disk Print contents of current window Open, close, or view log file Bring log window to front (not in Stata 7) Open window with help on using Stata (only in Stata 7) Bring Dialog Window to front Bring Results Window to front Bring Graph Window to front Open window to write a new program (Do file) or edit an existing Open window to view and edit data Open window to view data Turn off More Stop processing

The Stata windows give you all the key information about the data file you are using, recent commands, and the results of those commands. Some of them open automatically when you start Stata, while others can be opened using the Windows pull-down menu or the buttons on the tool bar. These are the Stata windows: Stata Results Stata Command Stata Browser Stata Editor Stata Viewer Variables Review Stata Do-file Editor To see recent commands and output To enter a command To view the data file (needs to be opened) To edit the data file (needs to be opened) To get help on how to use Stata To see a list of variables To see recent commands To write or edit a program (needs to be opened)

Each is described in more detail below. Stata Results This window (with the black backgound) shows all recent commands, output, error messages, and help info. In Stata 7, the text is color-coded as follows: white green blue yellow red Stata commands General information and the frame and headings of output tables Commands or error messages that can be clicked on for more information (in Stata 7 only) Numbers in output tables Error messages

The slide bar on the right side can be used to look at earlier results that are not on the screen. However, unlike SPSS, the Stata results window does not keep all output generated. It will keep about 300-600 lines of the most recent output, deleting earlier output. If you want to store output in a file, you must use the log command.

N. Minot

Page 1-7

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Stata Command This window (small with a white background) allows you to enter commands which will be executed as soon as you press the Return key. You can also use recent commands again by using the PageUp key (to go to the previous command) and PageDown key (to go to the next command). Stata Browser This window shows all the data in memory. The Stata Browser does not appear automatically when you start Stata. The only way to open the Browser is to click on the buttom with a table and magnifying glass. Unlike SPSS, when the Stata Browser is open, you cannot execute any commands, either from the Stata Command window or from the Do-file Editor. In addition, you also cannot change any of the data. You can, however, sort the data or hide certain variables using buttons at the top of the Stata Browser window. Stata Editor This window is exactly like the Stata Browser window except that you can change the data. We do not recommend using this window because you will have no record of the changes you make in the data. It is better to correct errors in the data using a Do-file program that can be saved. Stata Viewer This window provides help on Stata commands and rules. To open the Stata Viewer window, you can click on Windows/Viewer or click on the eye button on the tool bar. To use the Stata Viewer window, type a command in the space at the top and the Viewer will give you the purpose and rules for using that command, along with some examples. Any blue text in the Viewer can be clicked on for more information about that command. Variables This window (tall with a white background) lists all the variables that exist in memory. When you open a Stata data file, it lists the variables in the file. If you create new variables, they will be added to the list of variables. If you delete variables, they will be removed from the list. You can insert a variable into the Stata Command window by clicking on it in the Variables window. Review This window (with a white background) lists all the recent commands. If you click on one of the commands, it appears in the Stata Command window and can be executed by pressing the Return key. The slide bar can be used to view earlier commands. Do-file Editor This window allows you to write, edit, save, and execute a Stata program. A Stata program (or Do-file) is simply a set of Stata commands written by the user. The advantage of using the Do-file Editor rather than the Stata Command window is that the Do-file allows you to save, revise, and rerun a set of commands. Exploratory analysis of the data can be done with the Stata Command window, but any serious data analysis should be carried out using the Do-file Editor, not the Stata Command window. The Do-File Editor can be opened by clicking on Windows/Do-file Editor or by clicking on the envelope button. With so many windows, it is sometimes difficult to fit them all on the screen. You can adjust the size and position of each window the way you like it and then save the layout by clicking on Prefs/Save Windowing Preferences. Each time you open Stata, the windows will be arranged according to your prefered layout. Table 2 (below) provides a list of Stata commands that will be introduced in Module 1:

N. Minot

Page 1-8

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Table 2. Stata commands and topics covered in Module 1

3. Exploring data clear use describe list summarize tabulate tab1 tab2 save help by prefix if suffix in suffix set more set mem set scrollbufsize 4. Storing commands and output Stata Do-file editor log exporting tables 5. Creating new variables gen replace operators functions recode tab , generate xtile

6. Making tables labeling data #delimit tabulate summarize tabstat table using weights 7. Graphs graph histogram scatterplot bar xlabel ylabel connect( ) symbol( ) 8. Modifying files drop drop if keep keep if sort compress collapse merge append fillin reshape

9. Programming creating and using macros creating and using loops matrix algebra 10. Regression analysis regress test testparm predict probit ovtest hettest

N. Minot

Page 1-9

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 3: EXPLORING DATA FILES This section covers commands that are used for preliminary exploration of data in a file. The following commands and topics are described: clear use describe list summarize tabulate by prefix if suffix in suffix save help set mem set more set scrollbufsize clear The clear command deletes all files, variables, and labels from the memory to get ready to use a new data file. You can clear memory using the clear command or by using the clear subcommand as part of the use command (see the use command). This command does not delete any data saved to the hard-drive. use This command opens an existing Stata data file. It is equivalent to get in SPSS. The syntax is: use filename [, clear ] use [varlist] [if exp] [in range] using filename [, clear ] opens new file opens selected parts of file

If there is no extension, Stata assumes it is .dta. If there is no path, Stata assumes it is in the current folder. You can use a path name such as: use d:\data\scr02a If the path name has spaces, you must use double quotes: use d:\my data\scr02a You can open a selected variables of a file using a variable list. You can open selected records of a file using if or in.

Here are some examples of the use command: use hhexp98n use hhexp98n if reg7 == 1 use hhexp98n in 5/25 use househol age sex using hhexp98n use d:\data\VLSS\scr01a2 folder use d:\data files\VLSS 98\scr01a2 use scr01a2, clear opens the file hhexp98n.dta for analysis. opens data from one region opens records 5 through 25 of file opens 3 variables from hhexp98n file opens the file scr01a2.dta in the specified use quotation marks if there are spaces clears memory before opening the new file

N. Minot

Page 1-10

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

describe This command provides a brief description of the data file. You can use des and Stata will understand. The output includes: the number of variables the number of observations (records) the size of the file the list of variables and their characteristics

Example 1: Using describe to show information about a data file


. use hhexp98n . describe Contains data from hhexp98n.dta obs: 5,999 vars: 67 6 Jan 2000 08:43 size: 1,553,741 (98.5% of memory free) ------------------------------------------------------------------------------storage display value variable name type format label variable label ------------------------------------------------------------------------------househol long %12.0g household code year float %9.0g Year of interview month float %9.0g Month of interview vlssmphs byte %8.0g 1 if vlss, 2 if mphs source sex byte %8.0g Gender of HH.head (1:M;2:F) age int %8.0g Age of household head agegroup byte %8.0g agegroup age group of HH.head comped98 float %9.0g diploma completed diploma HH.head educyr98 float %9.0g schooling year of HH.head farm float %9.0g loaiho Type of HH (1:farm; 0:nonfarm) urban98 byte %8.0g urban 1:urban 98; 0:rural 98 urban92 float %9.0g urban 1:urban92; 0:rural92 province float %9.0g Province code reg7 int %8.0g Code by 7 regions reg8 int %8.0g Code by 8 regions reg10 int %8.0g Code by 10 regions hhsize long %12.0g Household size hhcat float %9.0g hhsize categories wt int %8.0g sample weight hhsizewt float %9.0g =hhsize*wt vill float %9.0g village code [output truncated hee)

It also provides the following information on each variable in the data file: the variable name the storage type: byte is used for binary variables, int is used for integers, and float is used for continuous variables that may have decimals. To see the limits on each storage type, type help datatypes the display type indicates how it will appear in the output. the value label is the name of a set of labels for different values the variable label is a name for the variable that is used in output.

Example 1 gives the description of the summary file from the VLSS called hhexp98n.

N. Minot

Page 1-11

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

list This command lists values of variables in data set. It is similar to list in SPSS. The syntax is: list [varlist] [if exp] [in range] With varlist, you can specify which variables values will be presented. If no list is specified, all variables will be listed. With if and in, you can specify which records will be listed. Here are some examples: . list . list in 1/10 . list househol reg7 . list househol age in 1/20 . list if reg7 < 6
Example 2: Using list to look at data
. use hhexp98n . list househol urban98 reg8 in 1/10 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. househol 101 103 105 107 108 109 110 111 112 113 urban98 Urban Urban Urban Urban Urban Urban Urban Urban Urban Urban reg8 1 1 1 1 1 1 1 1 1 1

lists entire dataset lists observations 1 through 10 lists selected variables lists observations 1-20 for selected variables lists cases in region is 1 through 5

. list househol reg8 vill if vill==32 482. 483. 484. 485. 486. 487. 488. 489. 490. 491. 492. 493. 494. 495. 496. 497. househol 3201 3203 3205 3206 3207 3208 3215 3216 3218 3221 3222 3223 3224 3225 3226 3227 reg8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 vill 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32

If you are not careful with list, you will get a lot more output than you want. If Stata starts giving you more output than you really want, use the stop buttom (red button with an X).

N. Minot

Page 1-12

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

summarize The summarize command produces statistics on continuous variables like age, income, farm size, or This is like the describe command in SPSS. The syntax looks like this: summarize [varlist] [if exp] [in range] [, [detail]] By default, it produces the following statistics: Number of observations Average (or mean) Standard deviation Minimum Maximum If you specify detail, Stata gives you additional statistics.such as skewness, kurtosis, the four smallest values the four largest values various percentiles. Here are some examples: . summarize . summarize age income . summarize age income if reg8==3 gives statistics on all variables gives statistics on selected variables gives statistics on two variables for one region

Example 3. Using summarize to study continuous variables

. sum age educyr98 food ricexpd


Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------age | 5999 48.01284 13.7702 16 95 educyr98 | 5999 7.094419 4.416092 0 22 food | 5999 7272.777 4634.887 542.1666 85499.25 ricexpd | 5999 2267.346 1140.367 0 9792 . sum age educyr98 food ricexpd if reg8==3

Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------age | 128 42.07031 11.38184 24 79 educyr98 | 128 8.609375 3.224499 0 17 food | 128 5290.059 1756.087 1795 13022.08 ricexpd | 128 2735.59 1081.493 747 7344

The first example gives the statistics for the whole sample, while the second gives the statistics only for households in Region 3, the Red River Delta. Notice that residents in the Red River Delta are somewhat younger but with more education than the national averages.

N. Minot

Page 1-13

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

tabulate, tab1, tab2 These are three related commands that produce frequency tables for discrete variables. They can produce one-way frequency tables (tables with the frequency of one variable) or two-way frequency tables (tables with a row variable and a column variables. These commands are similar to the freuqncy and crostab commands in SPSS. How do they differ? tabulate or tab produce a frequency table for one or two variables tab1 produces a one-way frequency table for each variable in the variable list tab2 produces all possible two-variable tables from the list of variables

You can use several options with these commands: all gives all the tests of association for two-way tables cell gives the overall percentage for two-way tables column gives column percentages for two-way tables row gives row percentages for two-way tables nofreq suppresses printing the frequencies. chi2 provides the chi squared test for two-way tables

There are many other options, including other statistical tests. For more information, type help tabulate. Some examples of the tabulate commands are: . tabulate reg7 produces table of frequency by region . tabulate reg8 sex produces a cross-tab of frequencies by region and sex . tabulate reg8 sex, row produces a cross-tab by region and sex with row percentages . tabulate reg8 sex, cell nofreq produces a cross-tab of overall percentages by region and sex . tab1 reg8 sex ethnic produces three tables, a frequency table for each variable . tab1 region sex ethnic produces three tables, a frequency table for each variable . tab2 reg8 sex urban98 produces three tables, a cross-tab of each pair of variables

N. Minot

Page 1-14

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Example 4. Using tabulate on categorical variables


. tab farm Type of HH | (1:farm; | 0:nonfarm) | Freq. Percent Cum. ------------+----------------------------------non farm | 2561 42.69 42.69 farm | 3438 57.31 100.00 ------------+----------------------------------Total | 5999 100.00 . tab sex farm Gender of | Type of HH (1:farm; HH.head | 0:nonfarm) (1:M;2:F) | non farm farm | Total -----------+----------------------+---------No 1 | 1673 2702 | 4375 2 | 888 736 | 1624 -----------+----------------------+---------Total | 2561 3438 | 5999 . tab sex farm, row col chi2 Gender of | Type of HH (1:farm; HH.head | 0:nonfarm) (1:M;2:F) | non farm farm | Total -----------+----------------------+---------1 | 1673 2702 | 4375 | 38.24 61.76 | 100.00 | 65.33 78.59 | 72.93 -----------+----------------------+---------2 | 888 736 | 1624 | 54.68 45.32 | 100.00 | 34.67 21.41 | 27.07 -----------+----------------------+---------Total | 2561 3438 | 5999 | 42.69 57.31 | 100.00 | 100.00 100.00 | 100.00 Pearson chi2(1) = 130.8340 Pr = 0.000

N. Minot

Page 1-15

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

by

In one-way tables, Stata gives the count, the percentage, and the cumulative percentage (see first example in box). In two-way tables, Stata gives the count only, unless you ask for other statistics (see second example in box) col, row, and cell request Stata to include percentages in two-way tables

This prefix goes before a command and asks Stata to repeat the command for each value of a variable. There is no equivalent command in SPSS. The general syntax is: by varlist: command Some examples of the by prefix are: by sex: sum hhsix by reg8: tab urban98 for each sex of head of household, give stats on household size for each region, give the frequency table of urban/rural

Example 5. Using the by prefix


. sort sex . by sex: sum hhsize _____________________________________________________________________________ __ -> sex = 1 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hhsize | 4375 5.058286 1.852724 1 16 _____________________________________________________________________________ __ -> sex = 2 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hhsize | 1624 3.927956 1.982762 1 19

save This command saves the data in memory. It is equivalent to save outfile in SPSS. The syntax is: save [filename] [, replace ] saves file

If you do not give a file name, it will use the current name. You cannot write over an old file unless you specify replace (unlike in SPSS).

N. Minot

Page 1-16

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

if We have already seen several examples of using if to select certain records in carrying out a command. This is similar to the process if command in SPSS, except that in Stata it is not considered a separate command. The syntax is: command if exp Examples include: . list hhid region income if income>12000 . tab region if income>10000 & income<20000 range . summarize income if region==1 | region==2 lists data if income is above 12000 frequency table of region if income is in statistics on income for regions 1 and 2

Note that if statements always use ==, not a single =. Also note that | indicates or while & indicates and. in We have also used in to select records based on the case number. The syntax is: command in exp For example: . list in 10 . summarize in 10/20 . help list observation number 10 summarize observations 10-20

The help command gives you information about any Stata command or topic help [command] For example, . help tabulate . help summarize set The set command is used to control the Stata operating environment. There are 22 set commands, but many of them are rarely used. Some of the more common ones are: set mem XXm sets memory for Stata at XX megabytes. If you get the error message No room to add more observations, this means the datafile is too big for the memory allocated to Stata. This command increases the memory allocated to Stata. You cannot set XX greater than the RAM memory in the computer. set more off/on is used to turn on and off the continuous scrolling of output. Use set more off if you are not interested in the intermediate output, only the final result. Use set more on if you need to be able to read the early output. Remember that the Results Window only stores the most recent 300-600 lines of output. Unlike SPSS, Stata does not automatically store all of your output. gives a description of the tabulate command gives a description of the summarize command

N. Minot

Page 1-17

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

set scrollbufsize XX is used to change the amount of output that Stata will store. XX is expressed in bytes. The default is 32,000 (32k) and the maximum is 500,000 (500k). Type help set for a list of other settings in Stata. Exercises for exploring the VLSS This section includes some questions that you can answer using the VLSS files provided on your computer and the commands described in this section1. Remember two tricks to make it easier to fix your mistakes: You can use PageUp to retrieve the most recent command. You can click on variables in the Variable window to paste it into the Command window.

Summary file The file hhexp98n contains summary variables calculated from various other data files. It is at the househlold level. Open the file by entering use hhexp98n. in the Command window and pressing Return. 1. How many variables and how many records are in hhexp98n? (Answer: describe) 2. What percentage of households have female heads? (Answer: tab sex) 3. Is there a statistically significant difference between the percentage of female-headed households in urban and rural areas? (use the chi2 option) 4. What percentage of urban households are considered farm household? (use if urban98==1 option) 5. What percentage of farm households are in urban areas? 6. How does the percentage of female headed household vary by region? 7. What is the average size of a household? 8. What is the average size of an urban household in the Red River Delta? (reg8=1 refers to RRD) 9. How does household size vary with across expenditure quintiles? (use quint98b for quintiles, you will need to sort and then use by) Household members The file scr01a2 contains information about each member of the household. It is at the individual level (each record is a person). You can answer the following questions using this file: 1. What percentage of the population is female? (Answer: tab s1aq02) 2. What percentage of the population over 80 years old is female? (use tab if ..) 3. What percentage of the population under 5 is female?

To get the correct answers, we should use the sample weights which are described later. The weights compensate for the fact that some types of households are over-represented in the VLSS sample and others are under-represented. For example, urban households make up 29 percent of the sample, but only 24 percent of the population. Sampling weights are described in Section 6.

N. Minot

Page 1-18

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

4. What percentage of women are married? 5. What percentage of the women over the age of 20 are married? 6. Does this percentage vary between urban and rural areas? 7. What percentage of the spouses of family members live in the household? 8. Is the percentage of spouses away greater for men or for women? Housing characteristics The file scr06b1 contains information about the characteristics of houses. Open the file and use des to obtain a list of variables. The following questions can be answered from this file. 1) What is the average value of the house, according to the respondent? (Answer: sum s6bq12) 2) What are the most important sources of water? (use tab) 3) Of those households that think their water is safe before boiling, what percentage boil their water before drinking? 4) What is the average value of the house among those who get their water from an inside private tap? 5) What is the average value of the house among those who get their water from a hand-dug well? 6) What is the average value of the house for each type of source of drinking water? (you will need to sort by drinking water type and then use the by option) Food crops The file scr09b2 contains information on production of food crops other than rice. The data are at the crop level, meaning that each record represents one crop for one household. Only crops that are grown by each household are included in the file. The crop codes are in the questionnaire on pages before and after the questions. You can answer the following questions with this file. 1. How many households in the sample grow maize? (Answer: tab s9b2cc) 2. Among maize growers, what was the average area with maize? (Answer: sum s9b2q03 if s9b2cc==8) 3. Among maize growers, what was the average amount of maize harvested, sold, and given to livestock? 4. Among farmers with more than 1 hectare of maize, what was the average amount of maize harvested, sold, and given to livestock? (you will need an if statement that selects both for maize and for area greater than 10,000 m2) 5. What is the average amount harvested and sold for each food crop other than rice? (you will need to sort and use by s9b2cc) 6. Farmers were asked what percentage of the normal harvest did they get this year, so 100% means normal. What was the average response? 7. How much are the post-harvest losses in maize relative to the size of the harvest? Tomatoes?

N. Minot

Page 1-19

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 4: STORING COMMANDS AND OUTPUT In this section, we discuss how to store commands and output for later use. First, we describe how to store commands a program (Stata calls it a Do-file) , how to edit the program, and how to run it. Second, we present different ways of saving and using the output generated by Stata. The following topics are covered: using the Do-file Editor log using log off log on log close set logtype moving tables from Stata to Word and Excel Using the Do-file Editor As mentioned in Section 2, the Do-file Editor allows you to store a program (a set of commands) so that you can edit it and execute it later. Why use the Do-file Editor? It makes it easier to check and fix errors, It allows you to run the commands later, It lets you show others how you got your result, and It allows you to collaborate with others on the analysis.

In general, any time you are running more than 10 commands to get a result, it is easier and safer to use a Do-file to store the commands. To open the Do-file Editor, you can click on Windows/Do-file Editor or click on the envelope on the Tool Bar. Within the Do-file Editor, there is a menu bar and tool bar buttons to carry out a variety of editing functions. The menu bar is similar to the one in Microsoft Word: File/New File/Open File/Save File/Save as File/Insert file File/Print File/Close Edit/Undo Edit/Cut Edit/Copy Edit/Paste Search/Find Search/Replace Tools/Do Tools/Run to open a new, blank Do-file to open an existing Do-file to save the current Do-file to saving the current Do-file under a new name to insert another file into the current one to print the Do-file to close the Do-file to undo the last command to delete or move the marked text in the Do-file to copy the marked text in the Do-file to insert the copied or cut text into the Do-file to find a word or phrase in the Do-text to find and replace a word or phrase in the Do-file to execute all the commands or the marked commands in the Do-file to execute all the commands or the marked commands in the Do-file without showing any output in the Stata Results window

The tool bar buttons can be used to carry out some of these tasks more quickly. For example, there are buttons for File/New, File/Open, File/Print, Search/Find, Edit/Cut, Edit/Copy, Edit/Paste, Edit/Undo, Do, and Run. Probably the button you will use most is the second-to-last one that shows a

N. Minot

Page 1-20

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

page with text on it. This is the Do button for executing the program or the marked part of the program. Finally, the keyboard commands may be even quicker to use than the buttons. The most useful keyboard commands are: Control-O Control-S Control-C Control-X Control-V Control-Z Control-F Control-H Open file Save file Copy Cut Paste Undo Find Find and Replace

To run the commands in a Do-file, you can click on the Do button (the second-to-last one) or click on Tools/Do. If you want to run one or just a few commands rather than the whole file, mark the commands and click on the Do button. You do not have to mark the whole command, but at least one character in the command must be marked in order for the command to be executed (unlike SPSS, it is not enough to have the cursor on a command). Although layout is a matter of personal preference, it may be useful to have the Stata Results window and the other windows on one side of the screen and the Do-file Editor window on the other. This makes it easy to switch back and forth. When you arrange the windows the way you like, you can save the layout by clicking Prefs/Save Windowing Preferences. Each time you open Stata, it will use your chosen layout. Saving the output As mentioned in Section 2, the Stata Results window does not keep all the output you generate. In only stores about 300-600 lines, and when it is full, it begins to delete the old results as you add new results. You can increase the amount of memory allocated to the Stata Results window (see set scrollbufsize in Section 3), but even this will probably not be enough for a long session with Stata. Thus, we need to use log to save the output. There are four ways to control the log operations. 1. You can use the log button on the tool bar. It looks like a scroll. 2. You can click on File/Log to get four options: Begin (log using), Close, Suspend (log off), and resume (log on). 3. You can use log commands in the Stata Command window 4. You can use log commands in the Stata Do-file Editor. In this section, we describe the commands, which can be used in the Stata Command window or in a do-file (program). log using This command creates a file with a copy of all the commands and output from Stata. The first time you open a log, you must give a name to the new file to be created. The syntax is: log using filename [, append replace [ text | smcl ] ] where filename is that name you give the new file. The options are:

N. Minot

Page 1-21

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

append replace text smcl Here are some examples:

adds the output to an existing file replaces an existing file with the output tells Stata to create the log file in text (ASCII) format tells Stata to create the log file in SMCL format

log using temp22 log using temp20, replace log using regoutput, append log using d:\my data\myfile.txt

saves output to a file called temp22 saves output to an existing file, temp20, replacing content saves output to an existing file, results,adding to contents saves output in specified file in specified folder

Several points should be remembered in using this command: log off This command temporarily turns off the logging of output, so that any subsequent output is not copied to the log file. This is useful if you want to save some of the output but not all. Log off only works after a log using command. log on This command is used to restart the logging, copying any new output to the log file that was already defined. Log on only works after a log using and a log off command. log close This command is used to turn off the logging and save the file. How are log off and log close different? Log off allows you to turn it back on easily with log on, continuing to use the same log file. After a log close however, the only way to start logging again is with log using. set logtype text This command tells Stata to always save the log files in text (ASCII) format. It is the same as adding the text subcommand to every log using command, but it is easier. If you prefer text format log files (as I do), this is the best way to make sure all the log files are in this format. set logtype smcl This command tells Stata to always save log files in SMCL format. It is the same as adding the smcl subcommand to every log using command. Example 6 shows how the log command can be used. First, the log is opened using the filename temp1. Since I did not specify a folder, it saved the file to the default folder which (in this case) was my descktop. The results from tab urban98 are saved in the log file. Then the log is turned off, so the results of sum hhsize is not logged. Third, the log is turned on so the results from sum age are logged. Finally, the log is closed. if you use an existing file name but do not say replace or append, Stata will give an error message that the file already exists log files in text format can be opened with Wordpad, Notepad, the DOS editor, or any word processor., but the file does not have any formatting smcl files have formatting (bold, colors, etc) but can only be opened with Stata smcl format is the default

N. Minot

Page 1-22

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Example 6: Using log to save output


. log using temp1, text ---------------------------------------------------------------------------log: D:\Documents and Settings\NICHOLAS\Desktop\temp1.log log type: text opened on: 2 Aug 2002, 12:58:52 . tab urban98 1:urban 98; | 0:rural 98 | Freq. Percent Cum. ------------+----------------------------------Rural | 4269 71.16 71.16 Urban | 1730 28.84 100.00 ------------+----------------------------------Total | 5999 100.00 . log off log: D:\Documents and Settings\NICHOLAS\Desktop\temp1.log log type: text paused on: 2 Aug 2002, 12:59:26 ----------------------------------------------------------------------------. sum hhsize Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hhsize | 5999 4.752292 1.954292 1 19 . log on ----------------------------------------------------------------------------log: D:\Documents and Settings\NICHOLAS\Desktop\temp1.log log type: text resumed on: 2 Aug 2002, 12:59:48 . sum age Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------age | 5999 48.01284 13.7702 16 95 . log close log: D:\Documents and Settings\NICHOLAS\Desktop\temp1.log log type: text closed on: 2 Aug 2002, 13:00:00 ---------------------------------------------------------------------------

Using the output The easiest way to look at a log file is with File/Log/View, but there are several other ways to do it. You can: type view [filename] in the Stata Command window click on the Viewer button (it looks like an eye) and typeview [filename] if it is in text format, you can open the Stata Do-file Editor (Windows/Do-file Editor) and open the log file with the Editor (File/Open) if it is in text format, you can open Wordpad (Start/Programs/Accessories/WordPad) and then open the log file WordPad (File/Open)

To print output from the Stata Results window, you can click File/Print Results.
N. Minot Page 1-23

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

To print output from a log file, 1) Open the log file with Stata Viewer (File/Log/View) 2) Click on File/Print Viewer Unfortunately, it is not easy to copy Stata output to other software such as word processors and spreadsheets. It is best to copy tables from the Stata Viewer or from the Stata Results window using Edit/Copy Table. To move tables from a log file to an Excel table, 1) Open thelog file with Stata Viewer (File/Log/View) 2) Copy the table with Edit/Table Copy or Control-Shift C 3) Paste the table into Excel To move tables from a log file to a Word table, 1) 2) 3) 4) Open thelog file with Stata Viewer (File/Log/View) Copy the table with Edit/Table Copy or Control-Shift C Paste the table into Word with Control-V Mark the table and then click Table/Insert/Table

To move tables from the Stata Results window to Word or Excel, follow the above procedures starting with step #2. However, one problem with these procedures is that there has to be a clear division between columns. If there is a heading that overlaps two columns, the two columns will be merged. To avoid this, you can exclude the heading when you copy the table. Exercises for logging 1) Use the file hhexp98n and open a log file called results to save output. Then do a frequency table of region by urban. Close the log file. 2) 3) Copy the table into a Excel. Copy the table into a Word table.

N. Minot

Page 1-24

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 5: CREATING NEW VARIABLES In the previous sections, we described how to explore the data using existing variables. In this section, we discuss how to create new variables. When new variables are created, they are in memory and they will appear in the Data Browser, but they will not be saved on the hard-disk unless you use the save command. In this section, we will cover the following commands and options. generate replace tab , generate operators functions recode xtile generate This command is used to create a new variable. It is similar to compute in SPSS. The syntax is; generate newvar = exp [if exp] where exp is an expression like price*quant or 1000*kg. Several points about this command: : Unlike compute in SPSS, generate cannot be used to change the definition of an existing variable. If you want to change an existing variable, you need to use replace, You can use gen as an abbreviation for generate If the expression is an equality or inequality, the variable will take the values 0 if the expression is false and 1 if it is true If you use if, the new variable will have missing values when the if statement is false

For example, . generate age2 = age*age gen yield = quant/area if area>0 gen price = value/quant if quant>0 gen highprice = (price>1000) create age squared variable create new yield variable if area is positive create new price variable if quant is positive creates a dummy variable equal to 1 for high prices

replace This command is used to change the definition of an existing variable. The syntax is the same: replace oldvar = exp [if exp] [in exp] Some points to remember: Replace cannot be used to create a new variable. Stata will give an error message if the variable does not exist. There is no abbreviation for replace. Stata wants to make sure you really want to change the variable. If you use the if option, then the old values will be retained when the if statement is false You can use the period (.) to represent missing values For example,

N. Minot

Page 1-25

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

replace price = avgprice if price > 100000 replaces high values with an average price replace income =. if income<=0 replace negative income with missing value replace age = 25 in 1007 replace age=25 in observation #1007 tabulate generate This command is useful for creating a set of dummy variables (variables with a value of 0 or 1) depending on the value of an existing categorical variable. The syntax is: tabulate oldvariable, generate(newvariable) The old variable is a categorical (or discrete) variable. The new variables will take the form newvariable1, newvariable2, newvariable3, etc. Newvariablex will be equal to 1 if oldvariable=x and 0 otherwise. It is easier to explain with an example. Reg8 is a variable that takes values of 1-8 for the different regions of Vietnam. We can create eight dummy variables as follows: tab reg7, gen(region) This creates 8 new variables: region1=1 if reg8=1 and 0 otherwise region2 =1 if reg8=2 and 0 otherwise region8=1 if reg8=8 and 0 otherwise In Example 7, notice that there are 1175 households in region 1 (Red River Delta) and the same number of households for which with region1=1.
Example 7. Using tab, gen to create dummy variables
. tab reg8, gen(region) Code by 8 | regions | Freq. Percent Cum. ------------+----------------------------------1 | 1175 19.59 19.59 2 | 731 12.19 31.77 3 | 128 2.13 33.91 4 | 708 11.80 45.71 5 | 628 10.47 56.18 6 | 276 4.60 60.78 7 | 1241 20.69 81.46 8 | 1112 18.54 100.00 ------------+----------------------------------Total | 5999 100.00 . tab region1 reg8== | 1.0000 | Freq. Percent Cum. ------------+----------------------------------0 | 4824 80.41 80.41 1 | 1175 19.59 100.00 ------------+----------------------------------Total | 5999 100.00

N. Minot

Page 1-26

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

egen

This is an extended version of generate to create a new variable by aggregating the existing data. It is a powerful and useful command that does not exist in SPSS. To do the same thing in SPSS, you would need to create a new file with aggregate and merge it with the original file using match files. The syntax is: egen newvar = fcn(arguments) [if exp] [in range] , by(var) where newvar is the new variable to be created fcn is one of numerous functions such as: count( ) max( ) min( ) mean( ) median( ) rank( ) sd( ) sum( ) argument is normally just a variable var in the by() subcommand must be a categorical variable Suppose you want to estimate the demand for rice using household data. You calculate a price variable using household expenditure data, but some households do not buy rice. You can replace the missing values with provincial average prices as follows: egen avgprice = mean(price), by(province) Here are some other examples: egen avg = mean(yield) egen avg2 = median(income), by(sex) egen regprod = sum(prod), by(region) Example 8: Using egen to calculate averages
. egen avgexp = mean(rlpcex2), by(vill)

creates variable of average yield over entire sample creates variable of median income for each sex creates variable of total production for each region

. gen aboveavg = (rlpcex2>avgexp) . list househol vill rlpcex2 avgexp aboveavg in 40/50 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. househol 305 315 306 301 310 311 405 409 407 403 401 vill 3 3 3 3 3 3 4 4 4 4 4 rlpcex2 7858.862 13006.72 3787.546 12084.1 4785.421 6666.962 6583.107 14452.78 3549.75 4145.278 6454.877 avgexp 6643.441 6643.441 6643.441 6643.441 6643.441 6643.441 8231.103 8231.103 8231.103 8231.103 8231.103 aboveavg 1 1 0 1 0 1 0 1 0 0 0

N. Minot

Page 1-27

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

In Example 8, we want to know which households have per capita expenditure (rlpcex2) above the village average. First, we calculate the average expenditure for each village with the egen command. Then we create a dummy variable based on the expression (rlpcex2 > avgexp). The list output shows how the village average is repeated for every household in the village and confirms that the dummy variable is correctly calculated. operators This is not a Stata command, but a topic related to creating new variables. Most of the operators are obvious, but some are not. Unlike SPSS, you cannot use words like or, and, eq, or gt. Arithmetic + addition - subtraction * multiplication / division ^ power Relational > greater than < less than >= more than or equal <= less than or equal == equal ~= not equal != not equal Logical ~ not | or & and The most difficult rule to remember is when to use = and when to use ==. Use a single equal symbol (=) when defining a variable. Use a double equal symbol (==) when you are testing an equality, such as in an if statement and when creating a dummy variable.

Here are some examples to illustrate the use of these operators. Suppose you want you create a dummy variable indicating households in the Red River Delta. One way is to write: generate RRD = 0 replace RRD = 1 if reg8==1 Or you can get exactly the same result with just one command: generate RRD = (reg8==1) If the expression in parentheses is true, the value is set to 1. If it is false, the value is 0. Logical operators are useful if you want to impose more than one condition. For example, suppose you want to create a dummy variable for farmers in the Red River Delta. In other words, a household must be both in the Red River Delta and be a farmer to be selected.

N. Minot

Page 1-28

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

gen RRDfarm = 0 gen RRDfarm = 1 if reg8==1 & farm==1 or an easier way to do this would be: gen RRDfarm = (reg8==1 & farm==1) Or suppose you wanted to create a dummy variable for households in the two deltas. This means a household can be in the Red River Delta or it can be in the Mekong River Delta to be selected. This variable can be created with: gen delta = 0 replace delta = 1 if reg8==1 | reg8==8 or by one command: gen delta = (reg8==1 | reg8==8) You can also combine conditions using parentheses. Suppose you wanted a dummy variable that indicates if a household is a poor farmer in one of the deltas. We will define poor as in the bottom 20 percent and use the variable quint98. gen PDF = ((reg8==1 | reg8==8) & farm==1 & quint98 ==1) functions Again, this is not a command, but a topic that is related to creating new variables. Here is a list of some of the more commonly-used functions. Other functions can be found by typing help functions in the Stata Command window. abs(x) exp(x) ln(x) log(x) log10(x) sqrt(x) invnorm(p) normden(z) normden(z,s) norm(z) group(x) int(x) round(x,y) computes the absolute value of x calculates e to the x power. computes the natural logarithm of x is a synonym for ln(x), the natural logarithm. computes the log base 10 of x. computes the square root of x. provides the inverse cumulative normal; invnorm(norm(z)) = z. provides the standard normal density. provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not missing, otherwise, the result is missing. provides the cumulative standard normal. creates a categorical variable that divides the data into x as nearly equal-sized subsamples as possible, numbering the first group 1, the second group 2, etc. It uses the current order of the data. gives the integer obtained by truncating x. gives x rounded into units of y.

N. Minot

Page 1-29

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

recode This command changes the values of a categorical variable according to the rules specified. It is like the recode command in SPSS except that in Stata you do not use parentheses. The syntax is: recode varname old=new old=new [if exp] [in range] Here are some examples: recode x 1=2 recode x 1=2 3=4 recode x 1=2 2=1 recode x 1=2 *=3 recode x 1/5=2 recode x 1 3 4 5 = 6 recode x .=9 recode x 9=. changes all values of x=1 to x= 2 changes 1 to 2 and 3 to 4 exchanges the values 1 and 2 in x changes 1 in x to 2 and all other values to 3 changes 1 through 5 in x to 2 changes 1, 3, 4 and 5 to 6 changes missing to 9 changes 9 to missing

Notice that you can use some special symbols in the rules: * . x/y xy means all other values means missing values means all values from x to y means x and y

In Example 9, we create a new variable that indicates whether a household lives in the north, center, or south of Vietnam, using the reg8 variable.
Example 9. Using recode to define a new variable
. tab reg8 Code by 8 | regions | Freq. Percent Cum. ------------+----------------------------------1 | 1175 19.59 19.59 2 | 731 12.19 31.77 3 | 128 2.13 33.91 4 | 708 11.80 45.71 5 | 628 10.47 56.18 6 | 276 4.60 60.78 7 | 1241 20.69 81.46 8 | 1112 18.54 100.00 ------------+----------------------------------Total | 5999 100.00 . gen reg3 = reg8 . recode reg3 1/3 =1 4/6=2 7/8=3 (4824 changes made) . tab reg3 reg3 | Freq. Percent Cum. ------------+----------------------------------1 | 2034 33.91 33.91 2 | 1612 26.87 60.78 3 | 2353 39.22 100.00 ------------+----------------------------------Total | 5999 100.00

N. Minot

Page 1-30

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

xtile This command creates a new variable that indicates which category a record falls into, when the sample is sorted by an existing variable and divided into n groups of equal size. It is probably easier to explain with examples. xtile can be used to create a variable that indicates which income quintile a household belongs to, which decile in terms of farm size, or which tercile in terms of coffee production. The syntax is: xtile newvar = variable [if exp] [in range] , nq(#) where newvar variable # For example, pctile incquint = income, nq(5) pctile farmdec = farmsize, nq(10) pctile coffeeter = coffarea, nq(3) Suppose we want to create a variable indicating the tercile of rice expenditure per capita.
Example 10. Using xtile to create categories
. gen ricepc = ricexpd/hhsize . xtile riceterc = ricepc, nq(3) . tab riceterc 3 quantiles | of ricepc | Freq. Percent Cum. ------------+----------------------------------1 | 2000 33.34 33.34 2 | 2003 33.39 66.73 3 | 1996 33.27 100.00 ------------+----------------------------------Total | 5999 100.00 . tab riceterc farm, col nof 3 | Type of HH (1:farm; quantiles | 0:nonfarm) of ricepc | non farm farm | Total -----------+----------------------+---------1 | 46.70 23.39 | 33.34 2 | 31.47 34.82 | 33.39 3 | 21.83 41.80 | 33.27 -----------+----------------------+---------Total | 100.00 100.00 | 100.00

is the new categorical variable created is the existing variable used to create the quantile (e.g income, farm size) is the number of different categories (eg 5 for quintiles, 3 for terciles)

N. Minot

Page 1-31

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Exercises for generating new variables 1) Use the file hhexp98n. Create a variable called reg2 which indicates whether a household is in the north or the south of Vietnam based on reg8. Then do a frequency table of the new variable. 2) Using the same file, create a variable called hhquint that indicates the quintile of household size. Then do a frequency table on the new variable. 3) Using the same file, create a dummy variable called rurfarm that is equal to 1 if the household is a rural farm household and 0 otherwise Create another variabled called upland that is 1 if the household is in the Northwest, Northeast, or Central Highands. 4) Create a new variable avgexp which is equal to the regional average of expenditure (rlpcex2) (hint: use egen). Then calculate a new variable equal to the difference between the household expenditure and the regional average expenditure. 5) Use the file sco01a2. Create a variable hhisze which is equal to the total number of household members. (use egen) 6) Using the same file, create a new variable notmarry which is 1 if the person is single, divorced, or separated and 0 otherwise. 7) Create a set of dummy variables called relatxx based on the relationship of the person to the household head. For example, relat01 is a dummy for being the head, relat02 is a dummy for being the spouse, relat03 for a child, and so on. (hint: use tabgen)

N. Minot

Page 1-32

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 6: MAKING TABLES TO DESCRIBE DATA In Section 3, we described some basic commands for exploring data. In this section, we introduce some more powerful and flexible commands for generating results from survey data. We begin with an explanation of how to label data in Stata. Then we describe three commands for generating tables. Finally, we will describe the use of sampling weights in analyzing survey data. These are the topics and commands covered in this section: label variable label define label values #delimit tabulate summarize tabstat table using weights label variable This command is used to attach labels to variables in order to make the output easier to understand. For example, we know that reg8 indicates the number of the region where a household lives and that rlpcex2 means real per capita expenditure. But other people using our tables may not know this. So we may want to label the variables as follows: label variable reg7 Region label variable rlpcex2 Per capita expenditure You can use the abbreviation label var If there are spaces in the label, you must use double quotation marks. If there are no spaces, quotation marks are optional. This command is like variable label in SPSS except that you can only label one variable per command and Stata uses double quotation marks, not single The limit is 80 characters for a label, but any labels over 30 characters will probably not look good in a table.

label define This command gives a name to a set of value labels. For example, instead of numbering the regions, we can assign a label to each region. Instead of numbering the different sources of water, we can give them labels. The syntax is: label define lblname # "label" # "label" # label [, add modify] where lblname # label add modify Note that: You can use the abbreviation label def The double quotation marks are only necessary if there are spaces in the labels is the name given to the set of value labels are the value numbers are the value labels means that you want to add these value labels to the existing set means that you want to change these values in the existing set

N. Minot

Page 1-33

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Stata will not let you define an existing label unless you say modify or add This command is similar to value label in SPSS except that in Stata you give the labels a name and later attach it to the variable, while in SPSS you attach it to the variable in the same command.

label values This command attaches named set of value labels to a categorical variable. The syntax is: label values varname lblname where varname lblname is the categorical variable which will get the labels is a set of labels that have already been defined by label define

Here are some examples of labeling values in Stata. . label variable yield "Yield (tons/hectare)" gives label to variable yield . label define yesno 0 no 1 yes . label values electricity yesno . label define yesno 3 "perhaps", add . label define yesno 3 "maybe", modify defines set of labels called yesno attaches those labels to variable called electricity adds new value label to existing set modifies existing value label

. label define reglbl 1 RRD 2 NW 3 NE 4 NCC 5 SCC 6 CH 7NES 8 MRD . label values reg8 reglbl . label define reglbl 7 Southeast 8 Mekong Delta, modify Some additional commands that may be useful in labeling label dir label list label drop label save using label data to request a list of existing label names to request a list of all the existing value labels to delete a one or more labels to save label definitions as a Do-file to give a label to a data file

More information is available by typing help label in the Stata Command window. Example 11 shows a frequency table with and without labels. The first table has no labels. Then a label var command is used to define the label Region, a label define command creates a set of labels, and label values attaches those labels to the reg8 variable. The second table has both the variable label (in the upper left corner of the table) and the labels for the regions. Finally, we show how a label list can be used to give the labels assigned to a label name.

N. Minot

Page 1-34

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Example 11. Using label to make tables more readable


. tab reg8 reg8 | Freq. Percent Cum. ------------+----------------------------------1 | 1175 19.59 19.59 2 | 731 12.19 31.77 3 | 128 2.13 33.91 4 | 708 11.80 45.71 5 | 628 10.47 56.18 6 | 276 4.60 60.78 7 | 1241 20.69 81.46 8 | 1112 18.54 100.00 ------------+----------------------------------Total | 5999 100.00 . label var reg8 Region . label define reglbl 1 "Red River Delta" 2 "Northwest" 3 "Northeast" 4 "N.C > Coast" 5 "S.C. Coast" 6 "Central Highlands" 7 "Southeast" 8 "Mekong Delta" . label values reg8 reg8lbl . tab reg8 Region | Freq. Percent Cum. ------------------+----------------------------------Red River Delta | 1175 19.59 19.59 Northwest | 731 12.19 31.77 Northeast | 128 2.13 33.91 N.C Coast | 708 11.80 45.71 S.C. Coast | 628 10.47 56.18 Central Highlands | 276 4.60 60.78 Southeast | 1241 20.69 81.46 Mekong Delta | 1112 18.54 100.00 ------------------+----------------------------------Total | 5999 100.00 . lab list reglbl reglbl: 1 Red River Delta 2 Northwest 3 Northeast 4 N.C Coast 5 S.C. Coast 6 Central Highlands 7 Southeast 8 Mekong Delta

#delimit In Example 11, you may have noticed that the region labels were too long to fit on one line. This is inconvenient when you are writing the command because, whether you are in the Do-file Editor or the Stata Command window, you have to scroll over to read the end of the command. The #delimit command solves this problem by allowing you to change the symbol used to indicate the end of the command. The default is a hard-return, called cr by Stata. The alternative is the semi-colon. #delimit ; #delimit cr makes the semi-colon the indicator of the end of the command makes the hard-return the indicator of the end of the command

N. Minot

Page 1-35

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Some facts about #delimit: It can only be used in a Do-file. It does not work in the Stata Command window. The semi-colon is useful if you have long commands The hard-return is more convenient if you have short commands For example, the regional labels could be entered like this; label var reg7 Region #delimit ; label def reglb 1 North Uplands 2 Red River Delta 3 NC Coast 4 SC Coast 5 Central Highlands 6 Southeast 7 Mekong Delta ; #delimit cr lab val reg7 reglb An alternative way of dealing with long lines is: label def reglb */ */ */ */ */ */ 1 North Uplands /* 2 Red River Delta /* 3 NC Coast /* 4 SC Coast /* 5 Central Highlands /* 6 Southeast /* 7 Mekong Delta

The #delimit command and the /* symbols can be used with any command, but they are often used with value labels. tabulate summarize This command creates one- and two-way tables that summarize continuous variables. The command tabulate by itself gives frequencies and percentages in each cell (cross-tabulations). With the summarize option, we can put means and other statistics of a continous variable. The syntax is: tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options where varname1 varname2 varname3 options is a categorical row variable is a categorical column variable (optional) is the continuous variable summarized in each cell can be used to tell Stata which statistics you want

Some notes regarding this command: The default statistics are the mean, the standard deviation, and the frequency. You can specify which statistics with options means standard and freq You can use the abbreviation tabsum( ) This command is similar to the Stata command by var3: sum var3 except that the tabsum output is more attractive and tabsum allows two categorical variables This command is also similar to the SPSS command means var3 by var1

N. Minot

Page 1-36

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Some examples: tab reg8, sum(rlpcex2)

gives the mean, std deviation, and frequency of per capita expenditure for each region tab urban98, sum(hhsize) mean gives the mean household size for urban and rural households tab farm urban98, sum(food) gives the mean, std deviation, and frequency in each cell of a 2x2 table of farmers/nonfarmer and urban/rural

In Example 12, we give the output for three tabsum commands. The first table is a one-way table (just one categorical variable) showing the mean, standard deviation, and frequency of per capita expenditure for each expenditure quintile. In the second table, we use the mean option so only mean per capita expenditure is shown. In the third table, we add a second categorical variable (urban98) making it a two-way table. Although we could have requested all the the default statistics in the two-way table, it makes the table difficult to read so we do not advise it.

Example 12: Using tabsum


tab quint98, sum(rlpcex2) | Summary of Expenditure per capita | exp quint | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 1180.4918 256.92477 917 2 | 1738.2957 163.67476 1012 3 | 2248.0415 209.72735 1158 4 | 3080.1508 357.28157 1316 5 | 6571.6628 3597.4445 1596 ------------+-----------------------------------Total | 3331.6804 2768.0741 5999 . tab quint98, sum(rlpcex2) mean | Summary of | Expenditure | per capita quint | Mean ------------+-----------1 | 1180.4918 2 | 1738.2957 3 | 2248.0415 4 | 3080.1508 5 | 6571.6628 ------------+-----------Total | 3331.6804 . tab quint98 urban98, sum(rlpcex2) mean Means of B.M&Reg price adj. pc exp | 1:urban 98; 0:rural | 98 quint | Rural Urban | Total -----------+----------------------+---------1 | 1175.8359 1279.9701 | 1180.4918 2 | 1738.5241 1735.8682 | 1738.2957 3 | 2242.8277 2278.9809 | 2248.0415 4 | 3056.2124 3138.038 | 3080.1508 5 | 5260.4178 7253.5102 | 6571.6628 -----------+----------------------+---------Total | 2477.9412 5438.3927 | 3331.6804

N. Minot

Page 1-37

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

tabstat This command gives summary statistics for a set of continuous variable for each value of a categorical variable. The syntax is: tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname) where varlist statname varname is a list of continuous variables is a type of statistic is a categorical variable

Some facts about this command: The default statistic is the mean. Optional statistics subcommands include mean, sum, max, min, range, sd (standard deviation), var (variance), skewness, kurtosis, median, and pn (nth percentile). Without the by() option, tabstat is like summarize except that it allows you to specify the list of statistics to be displayed. With the by() option, tabstat is like "tabulate summarizeexcept that tabstat is more flexible in the statistics and format It is very similar to the SPSS command means.

Examples tabstat farmsize hhsize, stats(mean max min) tabstat farmsize hhsize, by(reg8) tabstat farmsize, stats(median) by(reg8)
Example 13. Using tabstat to create tables
. tabstat rlpcex2, stats(p25 p50 p75 mean) by(reg8) Summary for variables: rlpcex2 by categories of: reg8 (Region) reg8 | p25 p50 p75 mean -----------------+---------------------------------------Red River Delta | 1874.237 2583.79 3919.563 3392.663 Northwest | 1460.617 1973.746 2963.513 2361.977 Northeast | 1234.18 1633.964 2143.89 1831.876 N.C Coast | 1568.988 2102.81 2906.817 2604.028 S.C. Coast | 1813.223 2485.993 3610.65 3098.375 Central Highland | 1158.759 1818.21 2784.002 2114.033 Southeast | 2553.454 3851.385 6213.039 5034.884 Mekong Delta | 1795.612 2474.845 3586.993 3073.823 -----------------+---------------------------------------Total | 1772.714 2523.891 3866.33 3331.68 ----------------------------------------------------------

gives mean, max, and min of farmsize & hhsize three variables gives mean of two variables for each region gives the median farmsize for each region

N. Minot

Page 1-38

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

table This command creates a wide variety of tables. It is probably the most flexible and useful of all the table commands in Stata. The syntax is: table rowvar colvar [if exp] [in range], c(clist) [row col] where rowvar colvar clist row col is the categorical row variable is the categorical column variable is a list of statistic and variables is an option to include a summary row is an option to include a summary column

Some useful facts about this command: The default statistic is the frequency. Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min, median, and pn (nth percentile). The c( ) is short for contents of each cell. Like tab, it can be used to create one- and two-way frequency tables, but table cannot do percentages Like tabsum, it can be used to calculate basic stats for each value of a categorical variable Its advantage over tabsum is that it can do more statistics and it can take more than one continious variable Like tabstat, it can be used to calculate advanced stats for each value of a categorical variable Its advantage over tabstat is that it can use do two (and more) way tables, but its disadvantage is that it has fewer statistics. It is similar to table in SPSS, but easier to learn and less flexible in formatting

Here are some examples: . table reg8 , row . table reg8, c(mean income) . table regi8, c(mean yield sd yield median yield) . table reg8, c(mean yield) format(%9.2f) . table reg8 sex, c(mean yield) . table reg8 sex, c(mean income mean yield) table of frequencies by region with total row table of average income by region table of yield statistics by region table of average yields by region with format table of average yield by region and sex table of avg yield & income by region & sex

Some output from table commands is shown in Example 14. The first table is a two-way table of average household size by region and urban/rural. The second table is the same except that the format option has been added to reduce the size of the numbers. The option format(%4.1f) means fixed format with 4 digits and one to the left of the decimal point. The fourth table gives the average per capita expenditure for urban and rural households in each region (the sample did not include any urban areas in the Central Highlands). It uses a format(%6.0f) which expresses expenditure as an integer. Also note that it has a summary column, but no summary row. Usually, in a two-way table, it is useful to have both row and column summaries.

N. Minot

Page 1-39

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Example 14 Using table


. table reg8 urban98, c(mean hhsize) row col --------------------------------------------------| 1:urban 98; 0:rural 98 Region | Rural Urban Total ------------------+-------------------------------Red River Delta | 4.045977 4.0459184 4.0459574 Northwest | 5.0815972 3.6580645 4.7797538 Northeast | 5.3541667 4.09375 5.0390625 N.C Coast | 4.6666667 4.6759259 4.6680791 S.C. Coast | 4.7352941 5.0136364 4.8328025 Central Highlands | 5.8478261 5.8478261 Southeast | 5.1028571 4.780037 4.9621273 Mekong Delta | 5.1373494 4.3971631 4.9496403 | Total | 4.8702272 4.4612717 4.752292 --------------------------------------------------. table reg8 urban98, c(mean hhsize) row col format(%4.1f) --------------------------------------| 1:urban 98; 0:rural | 98 Region | Rural Urban Total ------------------+-------------------Red River Delta | 4.0 4.0 4.0 Northwest | 5.1 3.7 4.8 Northeast | 5.4 4.1 5.0 N.C Coast | 4.7 4.7 4.7 S.C. Coast | 4.7 5.0 4.8 Central Highlands | 5.8 5.8 Southeast | 5.1 4.8 5.0 Mekong Delta | 5.1 4.4 4.9 | Total | 4.9 4.5 4.8 . table reg8 farm, c(mean rlpcex2) col format(%6.0f) -----------------------------------------------| Type of HH (1:farm; | 0:nonfarm) Region | non farm farm Total ------------------+----------------------------Red River Delta | 4738 2410 3393 Northwest | 3534 1962 2362 Northeast | 2528 1779 1832 N.C Coast | 3466 2238 2604 S.C. Coast | 4142 2180 3098 Central Highlands | 2737 2044 2114 Southeast | 5914 3316 5035 Mekong Delta | 3657 2569 3074 ------------------------------------------------

N. Minot

Page 1-40

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

weights What are sampling weights? Sampling weights are used to compensate for under- or overrepresenting certain households in a sample and to allow extrapolation of the sample results to the population. Lets take a simple example: Suppose you wanted to estimate the total population of Hanoi by interviewing randomly 25% of the households. In your sample, there are h households and H people. Your estimate of the total population of Hanoi would be 4*h. Similarly, if you interview 10% of the households in Da Nang and find d households and D people, your estimate of the population of Da Nang would be 10*D. If you want an estimate of the population of the two cities together, you would calculate 4*H+10*D. If you wanted to estimate the average household size of the two cities, you would have to divide the estimated total population (4*H+10*D) by the estimated total number of households (4*h+10*d).

The basic principle is that the sampling weight is the inverse of the probability of selection. Because of clustering and sampling, virtually all random-sample surveys must use weights to make estimates that are valid for the whole population. Furthermore, the calculation of sums, averages, and percentages must take into account the sampling weights. Sampling weights in the VLSS The calculation of the sampling weights in the VLSS is much more complicated than the example given above, but the principle is the same. The GSO estimated the probability that each household would be selected and then calculated the sampling weight as the inverse of that probability. In the VLSS, the sampling weight is in hhexp98n.dta and the variable name is wt 2. We can use the table command to generate some statistics about the VLSS weights.
. table reg8 urban98, c(mean wt) row col format(%7.0f) --------------------------------------| 1:urban 98; 0:rural | 98 Region | Rural Urban Total ------------------+-------------------Red River Delta | 3482 2439 3134 Northwest | 3441 2330 3206 Northeast | 3692 2087 3291 N.C Coast | 3434 1803 3185 S.C. Coast | 2281 1841 2127 Central Highlands | 1320 1320 Southeast | 1806 2208 1981 Mekong Delta | 3093 2482 2938 | Total | 2869 2242 2689 ---------------------------------------

The average weight is 2688, meaning that household in the VLSS sample represents (on average) 2688 households in Vietnam. The new 2001 Vietnam Household Living Standards Survey will have a sample of about 75,000, so the weights will be much smaller, probably around 230.
2

This weight is used for calculating averages in which every household has equal weight. Sometimes, we want to give each person an equal weight, such as when we want to calculate the percentage of people that are in households below the poverty line. For these calculations, it is better to use the variable wthhsize as a weight. This variable is simply wt*hhsize.

N. Minot

Page 1-41

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Using sampling weights in Stata The calculation of weighted sums and weighted averages would be very tedious, but fortunately survey software such as SPSS and Stata do this for us. In SPSS, you turn on the weights and weights are used in all calculations until you turn it off. Stata is different in that you tell Stata which commands should use weights. Stata allows four kinds of weights: 1) fweights, or frequency weights, are weights that indicate the number of duplicated observations. 2) pweights, or sampling weights, are weights that denote the inverse of the probability that the observation is included due to the sampling design. 3) aweights, or analytic weights, are weights that are inversely proportional to the variance of an observation; 4) iweights, or importance weights, are weights that indicate the "importance" of the observation in some vague sense. Here we will focus on pweights and fweights3. The syntax for using weights is: command ... [weighttype=varname] ... In the case of the VLSS, we will generally be using the following syntax: command [pw=wt] Here are some examples: tab reg8 [fw=wt] sum hhsize [fw=wt] tab sex [fw=wt], sum(rlpcex2) tabstat hhsize [fw=wt], by(urban98) table reg8 [pw=wt], c(mean age) gives the weighted frequencies in each region gives the weighted mean household size gives table of weighted mean expenditure by sex of head of household gives the weighted average household size for urban and rural households gives the weighted mean age of heads by region

Example 15 shows the effect of weights. The first table gives the unweighted percentage of urban and rural households in each department. In the second table, the weights are turned on. Notice that the urban households represent almost 29 percent of the sample but just 24 percent of population. This means that urban households were slightly over-represented in the original VLSS sample (you can verify in the table above that urban weights are slightly smaller). This also means that using the raw, unweighted results would give too much weight to urban households relative to their share of the population. The box also shows that weighted and unweighted means are different. The average household size is 4.75 without the weights and 4.70 with the weights. Notice that the number of observations in the second is 1.6 million. This represents the extrapolated number of households. Type help weights in the Stata Command window for more information.

For a number of commands, like tab, sum, and tabstat, Stata does not allow pweight, but fweight gives the correct percentages and means.

N. Minot

Page 1-42

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Example 15. Using weights in generating tables


. tab reg8 urban98, nof row | 1:urban 98; 0:rural | 98 Region | Rural Urban | Total ------------------+----------------------+---------Red River Delta | 66.64 33.36 | 100.00 Northwest | 78.80 21.20 | 100.00 Northeast | 75.00 25.00 | 100.00 N.C Coast | 84.75 15.25 | 100.00 S.C. Coast | 64.97 35.03 | 100.00 Central Highlands | 100.00 0.00 | 100.00 Southeast | 56.41 43.59 | 100.00 Mekong Delta | 74.64 25.36 | 100.00 ------------------+----------------------+---------Total | 71.16 28.84 | 100.00 . tab reg8 urban98 [fw=wt], nof row | 1:urban 98; 0:rural | 98 Region | Rural Urban | Total ------------------+----------------------+---------Red River Delta | 74.03 25.97 | 100.00 Northwest | 84.59 15.41 | 100.00 Northeast | 84.15 15.85 | 100.00 N.C Coast | 91.37 8.63 | 100.00 S.C. Coast | 69.68 30.32 | 100.00 Central Highlands | 100.00 0.00 | 100.00 Southeast | 51.41 48.59 | 100.00 Mekong Delta | 78.58 21.42 | 100.00 ------------------+----------------------+---------Total | 75.95 24.05 | 100.00 . sum hhsize Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hhsize | 5999 4.752292 1.954292 1 19 . sum hhsize [fw=wt] Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hhsize | 1.6e+07 4.700221 1.908688 1 19

N. Minot

Page 1-43

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 7: PRESENTING DATA WITH GRAPHS This section provides a brief introduction to creating graphs. In Stata, all graphs are made with the graph command, but there are 8 types of charts and numerous subcommands for controling the type and format of graph. In this section, we focus on four types of graph and a few options. These are the subcommands covered in this section: graph histogram twoway bar pie matrix xlabel ylabel connect( ) symbol( ) graph This command generates numerous types of graphs and diagrams. The syntax is: graph [varlist] [if exp] [in range] , graphtype options where varlist graphtype options is the list of variables to graph is the type of graph are commands to control the look of the graph

The eight graph types are: histogram oneway twoway matrix box star bar pie Bar chart based on frequency Scatterplot with one variable Scatterplot with two variables Matrix of two-way scatterplot graphs Box-and-whisker plot Star chart Bar chart of means or sums Pie chart

There are too many options to describe here, but we describe how to make some of the more common graphs. The default graph type depends on the number of variables specified: The default graph type is histogram if only one variable is specified. The default graph type is two-way scatterplot if two or more variables are specified.

Some options are common to many graph types: title(text) specifies the title to use on the graph b2(text) specifies title on X axis (b for bottom) l2(text) specifies title on Yaxis (l for left) xlabel uses round values to label x axis ylabel uses round values to label y axis by(var1) repeat graph for each value of var1 Some options for histograms:
N. Minot Page 1-44

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

bin(#) freq percent normal

specifies that the histogram will have # bars label Y axis in terms of frequency label Y axis in terms of percent draws a normal curve with the means and SD of the variable

Some options for two-way scatterplots: connect( ) to specify how points are connected symbol( ) to specify what the marker look like Some options for bar charts: means graphs means of variables given stack stack the bars for each variable rather than putting them side by side Here are some examples of the graph command: graph x graph x, bin(5) xlabel ylabel graph y1 y2 x graph y x, by(region) graph a b c, bar graph a b c, bar means histogram of x histogram with 5 bars and rounded axis labels scatter plot of y1 and y2 against x scatter plots of y against x for each region graph sums of a, b, and c as bars graph means of a, b, and c as bars

Example 16 shows the result of the command graph ricexpd hhsize, xlabel ylable. It was inserted into Word by clicking Edit/Copy Graph in Stata and then Control-V in Word.
Example 16. Two-way scatterplot graph

N. Minot

Page 1-45

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

In Example 17, a histogram was created with the command: graph rlpcex2 if rlpcex2<20000, xlabel ylabel normal bin(20)
Example 17. Histogram of per capita expenditure in Vietnam

In Example 18, the data were sorted by reg8, then the graph was created with: graph rlpcex2, bar means by(reg8) ylabel
Example 18. Bar chart of per capita expenditure by region

N. Minot

Page 1-46

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 8: MODIFYING DATA FILES This section describes a number of commands that are used to modify and combine data files in Stata. We begin with a five simple commands and then move to five more complex ones. rename drop keep sort compress collapse merge append reshape fillin rename This command renames variables. Some examples: rename oldname newname rename s1aq06y age drop This command deletes records or variables. Examples are: drop if age>140 drop if area==. drop temp1 temp2 keep This command deletes everything but specified observations or variables. Examples include: keep if age <= 140 keep househol age rlpcex2 sort This command sorts the records in the file according to the value of specified variables. Examples are: sort reg8 househol sort urban98 compress This command reduces the size of the file by changing the data storage types. It will not make any changes that would cause Stata to lose data. This command has no options or arguments. sorts data file by reg8 and within each region by househol ID sorts by the dummy variable urban98 keeps only records in which age is 140 or under keeps only variables househol and rlpcex2, deleting others deletes records in which age is greater than 140 deletes records in which area is missing deletes variables temp1 and temp2

N. Minot

Page 1-47

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

collapse This command is used to create a new data file by aggregating the existing one. It allows you to change the level of the data file. Person-level data can be collapsed to the household level to calculate the size of the household. Crop level data can be collapsed to the household-level to calculate the value of agricultural production per household. The syntax is: collapse (stat) varlist1 (stat) varilist2, by(varlist3) where stat varlist1 varlist2 varlist3 refers to one of the statistics are the variables to be aggregated using the first statistic are the variables to be aggregated using the second statistic are the categorical variables which define the aggregation

Some points about the collapse command: The default statistic is mean Optional statistics are mean, sum, rawsum, count, max, min, median, and pn (the nth percentile, where n is between 1 and 100) The output file will have one record for each value of varlist3 in the by( ) option If no by( ) option is given, then the data will be collapse to one record This is similar to aggregate in SPSS except Stata does not require you to define a new name for the aggregated variable (by default, it uses the old variable name).

Examples of the collapse command: creates a dataset of provincial means ot age, education, and income collapse (median) income, by(province) creates a dataset of provincial medians of income collapse (mean) age (median) income, by(reg8) creates a dataset of regional means of age and regional medians of income collapse (mean) age educ (median) income creates a dataset with overall means & medians
Example 19. Using collapse to calculate household size
. use scr01a2 . sum s1aq05y

collapse age educ income, by(province)

Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------s1aq05y | 28069 1969.846 19.95489 1899 1998 . collapse (count) idcode if . gen hhsize = idcode . sum hhsize Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hhsize | 6002 4.751583 1.95443 1 19 s1aq11==1, by(househol)

N. Minot

Page 1-48

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

In Example 19, we use collapse to calculate the average household size from the person-level data. The first sum command shows that there are 28,069 records in the original person-level file. After the collapse, the second sum command indicates that there are just 6002 records. It also shows that the average household size (unweighted) is 4.75, the same figure we found in the hhexp98n file in Section 3. merge This command combines two files with different variables into one file. Until now, all the commands we have worked with used just one file. The VLSS has over one hundred files, however, and often we would like to combine data from differerent files. For example, to calculate expenditure we need to combine the files for food expenditure and non-food expenditure to calculate school attendance rates, we need to combine the file with age and the file with school attendance to examine the relationship between the value of the house and housing characteristics, we need to combine several files. to calculate the value of agricultural production, we need to combine the files for rice, other food crops, annual industrial crops, and permanent industrial crops.

Files can be combined vertically (top to bottom). In this case, the two files have different records and are linked by having the same variables. The files below have different records but the same varaibles. The first file has crops 1-10, while the second file has crops 11-20. They can be combined with append as described later. hhid 101 101 102 102 103 hhid 101 102 102 103 103 103 Two files before append crop area quant 1 4 1 7 2 crop 16 12 13 11 16 19 area quant value hhid 101 101 102 102 103 101 102 102 103 103 103 One file after append crop area quant value 1 4 1 7 2 16 12 13 11 16 19

value

Files can be combined horizontally (side to side). In this case, the two files have different variables and are linked by having the same observations (person, household, crop, etc.) The files below have different variables but the same records (household). The command merge will combine records with the same household identification number (hhid). This would allow an analysis of how housing value (in the second file) varies according to expenditure quntile (in the first).

N. Minot

Page 1-49

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

hhid 101 102 103 201 202 203 204

region urban

exppc

hhid 101 102 103 201 202 203 204

region urban

exppc

Two files before merge farm hhid housetype water elect 101 102 103 201 202 204 204 One file after merge quint farm housetype water elect value quint

value

The syntax for the merge command is: merge [varlist] using filename where varlist filename is the list of variables in common is the data file that the current data set will with merged with

Some notes about the merge command: Both the original file and the new file must be sorted by the common variable(s) before merging A variable called _merge is create which indicates the source of each record. _merge=1 means it is from the original data set only _merge=2 means it is from the new data set only _merge=3 means it is from both data sets. It is a good idea to run a tab _merge command after every merge to check the merger. The merge command in Stata is similar to the match files command in SPSS.

Some examples: use members merge hhid perid using educ use hhchar merge housing using hhid opens file members merges files members with educ with hhid and perid as the common variables opens file hhchar merges hhchar and housing using hhid as the common variable

In Example 20, we merge the list of household members (scr01a2) with the education file (scr02a). We open the household member file, rename some variables, delete others, and then sort. Next, we merge the member file and the education file. After renaming and dropping more variables, the des command shows that we haveage and sex from the first file and attend from the second.

N. Minot

Page 1-50

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Example 20. Using merge to calculate school attendance


use "D:\Vietnam Pov Mapping\Training\SCR01A2.DTA", clear . rename s1aq02 sex

. rename s1aq06y age . keep househol idcode sex age . des Contains data from D:\Vietnam Pov Mapping\Training\SCR01A2.DTA obs: 28,633 vars: 4 16 Dec 1999 15:00 size: 343,596 (99.7% of memory free) ----------------------------------------------------------------------------storage display value variable name type format label variable label ----------------------------------------------------------------------------househol long %12.0g HOGIADINH idcode byte %8.0g MA HIEU: sex byte %8.0g 2. Gioi tinh : age int %8.0g 6. [TEN] bao nhieu..?SO NAM: ----------------------------------------------------------------------------Sorted by: househol Note: dataset has changed since last saved . sort househol idcode . merge househol idcode using scr02a . gen attend=(s2aq03==1 | s2aq03==3)*100 . drop cluster-s2aq22 . des Contains data from D:\Vietnam Pov Mapping\Training\SCR01A2.DTA obs: 28,633 vars: 6 16 Dec 1999 15:00 size: 486,761 (99.5% of memory free) ----------------------------------------------------------------------------storage display value variable name type format label variable label ----------------------------------------------------------------------------househol long %12.0g HOGIADINH idcode byte %8.0g MA HIEU: sex byte %8.0g 2. Gioi tinh : age int %8.0g 6. [TEN] bao nhieu..?SO NAM: _merge byte %8.0g attend float %9.0g ---------------------------------------------------------------------------Sorted by: Note: dataset has changed since last saved . sort age . graph attend if age<25, bar means ylabel by(age)

N. Minot

Page 1-51

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

The graph in Example 21 shows the percentage attending school for each age from 0 to 24.
Example 21. Graph of school attendance by age

append This command combines two files with different records but the same variables. The syntax is: append using filename where filename is the name of the file to be added to the current data set. This command is similar to join files in SPSS. In the VLSS, the append command is useful in analyzing household expenditure and agricultural production. For example, the agricultural production data is found in six files: scr09b1 scr09b2 scr09b3 scr09b4 scr09b5 scr09b6 rice production other food production annual industrial crops permanent industrial crops fruit crops agro-forestry crops

In order to calculate the value of agricultural production, crop sales, or total income, it is necessary to combine these files. Since they have similar variables but refer to different observations (crops), we combine them with append. We will illustrate the method by combining the rice and other food files.

N. Minot

Page 1-52

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Because the variables are not quite the same, we need to rename the variables before combining the files. This will require more than 10 commands, so it is probably worth creating a Do-file by clicking on Window/Do-file Editor. In the file, we type the following commands:
use "D:\Vietnam Pov Mapping\Training\SCR09B1.DTA", clear rename s9b1cc crop rename s9b1q03 area rename s9b1q04 prod rename s9b1q061 saleq replace saleq = saleq*(1/.67) if s9b1q062==2 rename s9b1q071 buyer keep househol crop area prod saleq buyer des, short save riceprod use "D:\Vietnam Pov Mapping\Training\SCR09B2.DTA", clear rename s9b2cc crop rename s9b2q03 area rename s9b2q04 prod rename s9b2q06 saleq rename s9b2q071 buyer keep househol crop area prod saleq buyer des, short save foodprod use riceprod, clear append using foodprod save allfood des, short table crop, c(mean area mean prod mean saleq) format(%6.0f)

In Example 22 are selected results from the Stata Results window. The rice file (after modification) contained 8760 records and 6 variables. The other food file (after modification) contained 10,541 records and 6 variables. The combined file has 19,261 records (8720+10541) and 6 variables.
Example 22. Using append to combine files
Contains data from D:\Vietnam Pov Mapping\Training\SCR09B1.DTA obs: 8,720 vars: 6 11 Jul 1999 16:13 size: 261,600 (99.7% of memory free) Sorted by: househol crop Note: dataset has changed since last saved

Contains data from D:\Vietnam Pov Mapping\Training\SCR09B2.DTA obs: 10,541 vars: 6 11 Jul 1999 15:45 size: 316,230 (99.7% of memory free) Sorted by: househol Note: dataset has changed since last saved Contains data from allfood.dta obs: 19,261 vars: 6 4 Aug 2002 02:53 size: 654,874 (99.4% of memory free) Sorted by: Note: dataset has changed since last saved

N. Minot

Page 1-53

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

fillin This command inserts additional records into a file so that all combinations of two or more variables are in the file. Again, it is easier to give an example than to describe it. Suppose we are working with crop production data. Data are collected on 5 crops, but most households only grow 2-4 of them. The records exist only for crops grown by the household, as shown below:
File in original form

hhid 1 1 1 2 2 3 3 3

crop 1 3 5 1 5 1 4 5

area 3 1 1 1 1 2 1 4

prod 3 1 1 1 1 2 1 4

If we calculate the average area for each crop, it will give the average area among those growing the crop. If we want the average area including the non-growers, it is not easy to calculate. Stata allows you to fill in the missing records of crops not grown by each household. The syntax is easy: fillin varlist where varlist is the list of variables, every combination of which we want to exist in the file. Using our example above, the command would be fillin hhid crop Stata will look for all the values of hhid and all the values of crop in the file, then it will make sure every hhid-crop combination has a record. When it has to insert record, the values of the other variables will be missing. The new file would look like this:
File after fillin command

hhid 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

crop 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

area 3 . 1 . 1 1 . . . 1 2 . . 1 4

prod 3 . 1 . 1 1 . . . 1 2 . . 1 4

N. Minot

Page 1-54

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

If we calculate the average area and production on this file, we will get the same answer as above. Because missing values are not counted, the result will be the average among growers. But if we replace the missing values with zeros: recode area .=0 recode prod .=0 then the averages will include the zeroes. This is an extremely useful command, particularly for dealing with crop data and expenditure data.. SPSS does not have a similar command. reshape The command changes a file from tall to wide or from wide to tall. What do we mean by wide and tall. A wide file stores additional information as separate variables, while a tall file stores this information using additional records. An example will be easier to understand. Suppose a household credit survey asks about the amount and source of the three most recent loans. One way to store this data is with a wide file, in which additional loans are stored in additional variables. File in wide format hhid amount1 1 2 3 4 5 source1 amount2 source2 amount3 source3

The other way to store the data is with a tall file, in which additional loans are stored as additional records. File in tall format hhid 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 loannbr 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 amount source

Notice that both files have the same number of data points (30) for loan amount and source of the loan; they are just arranged differently. The reshape command allows you to convert one type of file into the other. For more information, type help reshape in the Stata Command window. For information on how to implement reshape, type help reshape.

N. Minot

Page 1-55

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 9: REGRESSION ANALYSIS This section describes the use of Stata to do regression analysis. Regression analysis involves estimating an equation that best describes the data. One variable is considered the dependent variable, while the others are considered independent (or explanatory) variables. Stata is capable of many types of regression analysis and associated statistical test. In this section, we touch on only a few of the more common commands and procedures. The commands described in this section are: regress test, testparm predict probit ovtest hettest regress This command carries out a regression analysis on the variables specified. The syntax is: regress depvar varlist [if exp] [in range] [options] where depvar varlist is the dependent variable is the list of independent variables

The regress command has many options for specifying the type and format of the output. Type help regress for more information. Some examples of the command: . regress y x1 x2 x3 x4 x5 . regress y x1 x2 x3 x4 x5 if region==1 . by region: regress y x1 x2 region* predict This command can be used to obtain predictions, residuals, etc., after regression analysis. predict newvarname [if exp] [in range] [, options] Two of the most common options are: xb predicted values of y are put in newvarname e residuals of the regression are put in newvarname For example: . regress y x1 x2 x3 . predict yhat, xb . predict e, resid . probit poverty age sex housing . predict index, xb . predict phat creates variable yhat with predicted values creates variable e with residuals creates variable index with the value of sum of XB creates variable phat with the predicted probability regress y with xs as independent variable same regression but only in one region region* means all variables starting with region..

Example 23 presents the results of a regression analysis of the determinants of rice expenditure. The results indicate that rice expenditure is greater in larger households headed by older males.

N. Minot

Page 1-56

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Example 23. Using regress to examine determinants of rice expenditure


. use hhexp98n, clear . tab reg8, gen(region) Code by 8 | regions | Freq. Percent Cum. ------------+----------------------------------1 | 1175 19.59 19.59 2 | 731 12.19 31.77 3 | 128 2.13 33.91 4 | 708 11.80 45.71 5 | 628 10.47 56.18 6 | 276 4.60 60.78 7 | 1241 20.69 81.46 8 | 1112 18.54 100.00 ------------+----------------------------------Total | 5999 100.00 . gen age2 = age^2 . regress ricexpd hhsize age age2 sex rlpcex2 educyr98 urban98 region* Source | SS df MS -------------+-----------------------------Model | 4.7119e+09 14 336562752 Residual | 3.0881e+09 5984 516065.748 -------------+-----------------------------Total | 7.8000e+09 5998 1300436.14 Number of obs F( 14, 5984) Prob > F R-squared Adj R-squared Root MSE = = = = = = 5999 652.17 0.0000 0.6041 0.6032 718.38

-----------------------------------------------------------------------------ricexpd | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | 414.4978 5.358447 77.35 0.000 403.9933 425.0023 age | 51.19595 4.82956 10.60 0.000 41.72827 60.66363 age2 | -.4741431 .0472076 -10.04 0.000 -.5666869 -.3815993 sex | -109.7768 22.93277 -4.79 0.000 -154.7333 -64.82034 rlpcex2 | .0187889 .0043095 4.36 0.000 .0103408 .027237 educyr98 | -6.659207 2.670341 -2.49 0.013 -11.89404 -1.424376 urban98 | -453.6721 24.4138 -18.58 0.000 -501.532 -405.8123 region1 | 36.34716 49.75018 0.73 0.465 -61.18113 133.8754 region2 | 312.9143 51.49933 6.08 0.000 211.957 413.8715 region3 | 443.1319 77.56194 5.71 0.000 291.0825 595.1813 region4 | -21.75859 51.96463 -0.42 0.675 -123.628 80.11081 region5 | -147.9203 52.73708 -2.80 0.005 -251.304 -44.53664 region6 | (dropped) region7 | 37.15175 49.34054 0.75 0.451 -59.57349 133.877 region8 | 55.16549 48.84751 1.13 0.259 -40.59324 150.9242 _cons | -777.223 123.29 -6.30 0.000 -1018.916 -535.53 ------------------------------------------------------------------------------

Rice expenditure is positively related to per capita expenditure (though interestingly, the coefficient was negative if you exclude the urban dummy variable). Urban households consume significantly less rice than rural households, even after controling for other factors. Compared to the Central Highlands (region6), households in the Northeast and Red River Delta spend more on rice. Note that Stata automatically dropped one of the regional dummy variables to avoid perfect multicollinearity.

N. Minot

Page 1-57

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

test This command tests linear hypotheses about the estimated parameters from the most recently estimated model. For example, . regress y age female educ region1 region2 region3 region4 . test region1=region2 test hypothesis that region1 coef = region2 coef . test region4 = (region1+region2)/2 test hypothesis given by equation . test educ=.1 test hypothesis that educ = 0.1 . test region1 region2 region3 region4 test of hypothesis that four region dummies are zero If you want to test the hypothesis that a set of related variables are all equal to zero, you can use the related testparm command: . testparm region* probit This command carries out a probit regression analysis of the specified variables. The syntax is: probit depvar indepvars [if exp] [in range] [, options] Probit analysis is used when the dependent variable is a categorical variable with only two values. An alternative is the dprobit command which reports the derivative of the probability with respect to each independent variable instead of the coefficient. Examples include: . probit y x1 x2 x3 . probit x1 x2 x3, robust . dprobit y x1 x2 x3 if reg8 ==1 ovtest Regression analyis generates the best unbiased linear estimates of the true coefficients provided that some assumptions are satisfied. One assumption is that there are no missing variables that are correlated with the error term. This command performs a Ramsey RESET to test for omitted variables (misspecification). The syntax is: ovtest [, rhs] This test amounts to estimating y = xb+zt+u and then testing t=0. If the rhs option is not specified, powers of the fitted values are used for z . Otherwis, the powers of the indiependent variables are used.. Examples of the test are: . regress y x1 x2 x3 . ovtest . ovtest, rhs hettest Another assumption behind regression analysis is that the variance of the error term is constant across the sample. When this assumption is violated, the problem is called heteroskedasticity. This command tests for heteroskedasticity. hettest [varlist] tests significance of powers of predicted y tests significance of powers of x1, x2, and x3 run a probit with y as dependent and xs as independent run a robust probit (weaker assumptions about error) run the probit in one region only test of hypothesis that all region* dummies are zero

N. Minot

Page 1-58

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

This command tests t=0 in Var(e)=s^2exp(zt). If varlist is not specified, the fitted values are used for z. If varlist is specified, the variables specified are used for z. This test is also known as the Breusch-Pagan test for heteroskedasticity. Examples are: . regress y x1 x2 x3 . hettest test whether variance related to predicted y . hettest x3 test whether variance related to x3 Example 24 gives the result of some tests related to the regression analysis shown earlier. The test command tests the hypothesis that both age variables are zero, finding that the probability is very low (less than .0000) so we can reject this hypothesis. This is not surprising since each is statistically significant on it own. The parmtest command tests the hypothesis that all the region coefficient are equal to zero (that region does not influence rice expenditure). The hypothesis is rejected, meaning that the regional coefficients are jointly significant. The ovtest rejects the hypothesis that there are no omitted variables, indicating that we need to improve the specification (prices would be a good start). And finally, hettest indicates that there is heteroskedasticity which needs to be dealth with. Example 24. Regression tests
. test age age2 ( 1) ( 2) age = 0.0 age2 = 0.0 F( 2, 5984) = Prob > F = 59.97 0.0000

. testparm region* ( ( ( ( ( ( ( ( 1) 2) 3) 4) 5) 6) 7) 8) region1 = 0.0 region2 = 0.0 region3 = 0.0 region4 = 0.0 region5 = 0.0 region6 = 0.0 region7 = 0.0 region8 = 0.0 Constraint 6 dropped F( 7, 5984) = Prob > F = 26.88 0.0000

. ovtest Ramsey RESET test using powers of the fitted values of ricexpd Ho: model has no omitted variables F(3, 5981) = 3.18 Prob > F = 0.0230 . hettest Cook-Weisberg test for heteroskedasticity using fitted values of ricexpd Ho: Constant variance chi2(1) = 1473.93 Prob > chi2 = 0.0000

test age age2

N. Minot

Page 1-59

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 10: INTRODUCTION TO PROGRAMMING WITH STATA This section provides a very quick introduction to the topic of programming with Stata. We touch on three topics: creating and using macros creating and using loops matrix algebra

The purpose here is not to provide a comprehensive description of how to program with Stata, but rather to give you an idea of the kinds of things that can be done with Stata. To fully describe Stata programming would require more space than is available here. Furthermore, I do not (yet) know enough about it to teach it. macros Macro assign a set of word or a number to a name. There are two types of macros. Global macros stay in memory until you leave Stata Local macros exist only with a program or a loop

The syntax is relatively simple: global gmname = expression local lmname = expression To use these macros later, you must use special symbols to tell Stata they are macros: $gmname `lmname One use of the global macro is to store the name of the folder with the data. global path = d:\data\vlss\1998\household use $path\scr09b2.dta In addition to saving you some time, this macro is useful if you share the program with others who have different names for the folders on their computer. By using the macro, your colleague can change the global command once rather than trying to change the path in every command that opens a file or saves a file. Local macros are used (among other places) in loops with the while command, so we will discuss them in the next section. while This command starts a loop, allowing groups of Stata commands to be repeated until some condition is met. The syntax is: while exp { commands } where exp is an expression. Stata repeats the commands as long as the expression is

N. Minot

Page 1-60

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

commands brackets

true. are any Stata commands that you want to repeat define the beginning and the ending of the commands to be repeated

This is an example of a loop that uses local macros to carry out a regression analysis of the determinants of housing value for each region: tab reg8, gen(region) local r = 1 while `r' <= 8 { regress housval roof floor wall room area water if region`r' == 1 local r = `r' + 1 } The tab command creates a dummy variables for each region (region1, region2, etc). The first local command creates a macro called r that is equal to 1. The while statement says that the commands in brackets will be repeated until the condition r<=8 is no longer true. On each loop, the regress command is carried out in one region (when r=3, the if statement is if region3==1). The second local command increases the value of r each time that the loop is completed. When r reaches 9, the loop stops because the while condition is no longer true. Then Stata goes on to the next command after the bracket. matrix Stata has a special set of commands for matrix algebra. These can be used to implement custom econometric procedures or for doing calculations on the output of regression analysis. This is a very short summary of a very long list of complex commands. complex set of commands (type help matrix for more information). 1. Creating matrices by hand Examples: matrix mymat = (1,2\3,4) matrix myvec = (1 5 3 1 3) matrix mycol = (1/5/3/1/3) 2. Setting the maximum matrix size For regular Stata, the default maximum matrix size is 40x40, but this can be increased up to 800x800 with the matsize command. For Stata SE, the default maximum is 400x400, but this can be increased up to 11,000x11,000. The maximum matrix size can be changed using set matsize 500 3. Manipulating matrices Examples: matrix D = B matrix beta = syminv(X'*X)*X'*y matrix C = (C+C')/2 matrix sub = A[1..., 2..5]/2 matrix A[2,2] = B makes matrix D equal to matrix B calculates beta using regression equation redefines C matrix in terms of old values defines matrix using sub-set of A matrix redefines subset of A matrix as equal to B sets the maximum size for a matrix at 500x500 commas separate elements, backslash indicates new row creates a row vector creates a column vector

N. Minot

Page 1-61

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

4. Converting variables into matrices and vice versa Variables can be converted into matrices and likewise matrices can be converted into variables. Type help mkmat for more information. 5. Using matrices created by Stata Some Stata commands create matrices which can be retrieved and used. For example, all the regression commands create the following: e(b) e(V) coefficient vector variance-covariance matrix of the estimates

And these matrices can be used as follows: matrix beta = e(b) matrix cov = e(V) 6. Accumulating cross-product matrices Most statistical computations involve matrix operations such as X'X or X'WX. In many cases, X may have a very large number of rows and a small number of columns. Stata has a special command for calculating cross-products in these cases. Type help matacum for more information. 15. Matrix utilities matrix dir matrix list matrix rename matrix drop lists the currently defined matrices displays the contents of a matrix renames a matrix deletes a matrix creates a vector called beta with the estimated coefficients creates a matrix called cov with the estimated covariances

N. Minot

Page 1-62

Vous aimerez peut-être aussi