Académique Documents
Professionnel Documents
Culture Documents
and LongitudinalDa
A. Introduction
B. Processing Date Variables
C. Workingwith Two-digitYearValues(the Y2K Problem
D. LongitudinalData
E. Selectingthe First or Last Visit per Patient
n ComputingDifferencesbetweenObservationsin a Lon
G. Computingthe Differencebetweenthe First and Last O
H. ComputingFrequencieson LongitudinalData Sets
YESI.D.;
I. CreatingSummaryData Setswith PROC MEANS or P
J. OutputtingStatisticsOther Than Means
A. INTRODUCTION
Working with datesis a task that data analystsfreque
manypowerfulresourcesfor workingwith dates.The
lPAlrlrC';
in almostany form or to computethe number of day
two dates.
Data collectedfor the samesetof subjectsat diff
longitudinaldata.Thesedatarequirespecializedtechn
seeinghow date valuesare handledwith SAS softwar
B. PROCESSING
DATEVARIABLES
Supposeyou want to read the following informationin
LENGTH_
ID DOB ADMIT DISCHRG DX FEE STAY AGE
This rather strange listing clearly demonstrateshow SAS stores dates.For example,
look at the date of birth for subject 003.Notice it is zero.Why? Becausethis person was
born on January 1, 1960,and this is day zero in SAS-land.Well, you wouldn't want to
show this listing to your bossor colleagues.How do we make the date values look like
the dates we know and love? Just as we used formats in the last chapter to change the
way our valuesprinted, we can use formats here to changethe appearanceof the date
values.Wedon't even have to use PROC FORMAT to create theseformats-SAS has
already done it for us.Two very popular date formats are MMDDYYIO. and DATE9.
Listing of Data Set HOSPITAL
001 1 0 1 2 1I 1 9 4 6 12DEC2004 14 D E C 2 0 0 4
002 0 5 / 0 1/ 1 9 8 0 08JU12004 08AUG2004
003 0 1/ 0 1 / 1 9 6 0 0 1JAN2004 04JAN2004
004 06/23l1998 1 1N0V2004 25DEC2004
to remove the fractional part of the age value. You can "nest" functions (place one in-
side of another), as shown here, to make this part of your program more compact (and
yes.elegant).
AGE = INT(YRDIF(DOB,ADMIT, ,ACTUAL,) );
Format Result
MMDDYY6. 102150
MMDDYY8. 10t2u50
MMDDYY1O. r0t2U\950
DATE7. 210CT50
DATE9. 210CT1950
WORDDATE. October2I,1950
rnsidered,say,18 years old, As October21,1904,or October2I,2004?SAS hasa
ral portion of his/her age in calledYEARCUTOFF:value. The valueyou suppl
nction.We can write: beginningof a 100-yearinterval.Any two-digitdatew
yearwindow.Startingwith SAS version7, the default
tion is 1920.Thus,any two-digityearwould fall betwe
st" functions(place one in- yearll7l40 would be readasJanuary1,I940;theyear
rogrammore compact (and 2015.rfyou want to changethe valueof the YEARCU
statement. To setthe valuebackto 1900,you would u
OPTTONS YEARCTITOFF = 1900;
wantedto round the apeto Chapter 17,Section D, contains a list of SAS fun
tion.Thisfunctionhastwo ing with dates.For example, month, day, and year va
.it.To round to the nearest SAS date, or you can extract a month or year from a
D. LONGITUDINALDATA
PATIEI{I ID
DATE OF VISIT (Month Day Year)
I{IIAP'F PAfrIF
@4 DATE2 MMDDYYB.
@L2 1lF.2 3.
e15 SBP2 3.
e18 DBP2 3.
@ 2 1D x 2 3.
824 mc.FEE2 4.
@28 I,ABFEE2 4.
E4 DATE3 MMDDYYB.
912 HR3 3.
@15SBP3 3.
@ 1 8D B P 3 3.
@ 2 1D X 3 3.
G24 DOCFEE3 4.
G2B LABFEE3 4.
*4
g4 DATE4 M}4DDYYB.
@12 FrR4 3.
G15 SBP4 3.
@18DBPA 3.
@2LDxA 3.
@24 mcEEE4 4.
G28 I"ABFEE4 4. ;
FORMATDATEI-DATE4 MMDDYY1O. ;
DATALINES;
007L0211,98307 0120 0 800140 0400150
00712011983072L300900200 050020 0
007
007
0 0 9 0 9 0 3L 9 8 30 5 6 r - r , 0 0 7 0 1 "030730 0 0 0 0
009
009
009
The number signs (#) in the INPUT statement
Sinceour date is in the month-day-year form, we use th
included an output format for our dates with a FOR
statement usesthe same syntax as the earlier example
our own formats. The output fbrmat MMDDYYI0. s
printed in month-day-year form with slashesbetween
With this method of one line per patient visit, w
lines of data for any patient who had lessthan four visi
subiect.Thisis not only clumsy but it also occupiesa lot
to compute within-subject means,we continue (before
A V E H R- M E A N ( O FH R I " - H R 4 ) ;
A V E S B P = M E A N ( O FS B P I - S B P 4 ) ;
A V E D B P= M E A N ( O FD B P 1 - D B P 4 ) ;
etc.
DATA PATIENTS;
INPUT @1 ID <?
@4 DATE MMDDYYS.
@12HR 3.
@r-5sBP 3.
@18DBP 3.
@2aDX 3.
@24 DOCFEE 4.
@28LABFEE 4.;
FORMAT DATE MMDDYY]-O.;
DATALINES;
ID DATE HR SBP DBP DX DOCFEE LABFEE
DATA PATIHTIIS;
INPUT Gl ID 53.
@4 DATE MI4DDYfB.
ISI.Z |1f(
@15 SBP
g].8 DBP {
821 DX ?
824 DOCTEE 4.
828 I,ABFEE 4.;
FORMAT DATEM}4DDYY1O, ;
DATALINES;
0 0 71 0 21 1 9 8 3 0 ? 0 1 2 8 0000 1 4 0 0 4 0 0 1 " 5 0
0 0 7 L 20 1 - 1 - 908732r . 30 0 9 0 0 2 0 0 0 5 0 0 2 0 0
0 0 9 0 9 0 3 1 9 8036 6 r - l - 0 0 ? 0 10"033?00 0 0 0
00507051983 074L40082013 00900000
0050115198208018009601402001500
0 0 5 06 1 8 1 9 8 2 0071 ? 0 0 8 4 0 1 - 4I 0000 4 0 0
005070319830641400840:-400800200
LABFEE
:: l::-q 1
l-.t, .Ltt>r'. .r-rt rD
RTIN;
our data set has been previouslysorted by the samevariable (it has).The effectof
addingthe BY statementis to haveSAS createwhat are calledFIRST.and LAST.vari-
ables.In this case,sinceour BY variableis ID, two variables,FIRST.ID and LAST.ID,
are automaticallycreated.Thesevariablesare availablein the DATA stepbut are not
addedto the data set (FIRST.and LAST. variablesare automaticallydropped).The
FIRST.and LAST. variablesare logicalvariables;thatis,they havevaluesof true (1) or
false(0).FIRST.ID will be true (1) wheneverwe are readingthe first observationfor a
givenID and will be false(0) otherwise;LAST.IDwill be true wheneverwe are read-
ing the lastobservationfor a givenID and will be false(0) otherwise.To clarifythis,the
following showsour observationsand the value of FIRST.ID and LAST.ID. Keep in
mind that the two variablesFIRST.ID and LAST.ID are not in the SAS data set (but
they are in the PDV and thus are availableto be referencedin the DAIA step) and
that the dataset PATIENTS is now in ID and DATE order.
IF condition;
Here is how it works:It the conditionis true, the program continuesto process
the statementsfollowing the IF statement;if the condition is false,the program re-
turns to the top of the DATA step.Specificallyin this case,if LAST.ID is true,the pro-
gram continuesand, since this is the bottom of the DATA step,an observationis
automaticallywritten out to the data set RECENT. If LAST.ID is not true, the pro-
gram returnsto the top of the DATA step (and an observationis not written to data
set RECENT).
iable (it has). The effect of
IledFIRST.and LAST. vari-
Listing of Data Set RECENT
es,FIRST.ID and LAST.ID,
the DATA step but are not ID DATE HR SBP DX
utomatically dropped). The
005 07l05/1983 74 140 B2 13
ey havevaluesof true (1) or
007 1210111983 130 90 20
tg the first observation for a 009 09/03/1983 oo 110 70 137
true whenever we are read-
rtherwise.To clarify this, the
'.lD
and LAST.ID. Keep in COMPUTING BETWEEN
DIFFERENCES OBSERV
rot in the SAS data set (but DATASET
:ed in the DATA step) and
Supposeyou want to compute the change (difference
rate and blood pressure,from visit to visit.With the d
vation per patient visit, this gets a bit tricky. Two ver
FIRST.ID I,AST.ID tions between observations are the LAG function an
10 how the LAG function works.
00 You may have come acrossthe term "lagged" val
00 asthma-relateddoctor visits to ozone levels,you may w
01 the current day's ozone level and the ozone levelsfrom
10 referred to as the ozone level, lagged24 hours.Now for
01
The LAG function returns the value of its argument-the
IL
cuted.What does this mean? An example will help. Loo
In this example,the value of OZONE_LAG}4 is the value of OZONE from the previ-
ous day. It is missing in the first observation since there is no previous day. As you
probably figured out by looking at the program and the listing.the LAG2 function re-
turns the value from two days earlier.There is a whole family of LAG functions.Now,
you may wonder why the definition seemedso strange.Why didn't we just say that the
LAG function returns a value from the previous observation?Well, becauseit doesn't
always.Look carefully at the following program:
DATA I,AGGARD;
INPT]'I X;
IF X GT 5 THBi LAG_X = LAG(X);
DATALINES;
7
9
1
;
PROC PRINT DATA=L,AGGARD;
TITLE -Demonstrating a Feature of the LAG Function",'
RUN;
Obs X LAG_X
1 7
2 I
J 1
4 I
raw data and created in assignmentstatementsin a D
time the DATA step iterates.)In observation 4. X
What is the value of X the last time the LAG funct
observation2 and the value of X was 9.That is what t
is the bottom line herc'l You usually do not want to
tionally.As long as you executethe LAG (or LAG2
rf OZONE from the previ_
iteration of the DATA step,you can think of the func
s no previous day. As you
previous observation.
ing,the LAG2 function re_
We are now ready to compute differences in
ily of LAG functions. Now'
from visit to visit. First the program, then the explan
t didn't we just say that the
n? Well,becauseit doesn,t
DIFF_HR - HR - LAG(IIR);
D]FF*SBP = SBP - LAG(SBP);
DrFF_DBP = DBP - LAG(DBP);
]F NCff FIRST.ID THEV OUIPT]T;
RUN,.
DATA DIFFERSICE;
SET PATISiTS;
RV TN.
DIFF_HR = DIF(HR);
DIFF_SBP = DIF(SBP);
DIFF*DBP = DIF(DBP);
IF NOT FTRST.ID THM{ OUTPUI;
RUN;
Well, don't do it! Remember you have to "prime the pump" and execute the LAG
function for every observation.As long as we do not output an observationfor the first
visit, all is well.
Here is the listing of the data set DIFFERENCE:
ID D A T EH R S B P D B P D X D O C F E E
L A B F E ED I F F H R D I F F _ S B P D I F F _ D B P
G. COMPUTINGTHE DIFFERENCE
BETWEENTHE FIRSTAND LAST OBSERVATION
FOR EACH SUBJECT
What if you want to see the differencesof heart rate and blood pressurefrom the first
visit to the last? You need a way of "remembering" a value from a previous observa-
tion. The SAS tool that does this for us is a retained variable.Using a RETAIN state-
ment, we can tell SAS not to set the value in the PDV (program data vector) to missing
when the DATA step iterates.So,if you set the value of a retained variable,it staysat
that value until you changeit. Let's see how we can use this to compute out difference
scores.Here is the program:
DATA F]RST-I,AST;
SET PATIENTS;
DV TN.
005 0 7 / 0 5 / 19 8 3 74 140 82 IJ
007 1 2 1 0 1I 1 9 8 3 72 130 90 20
FI RST_ FI RST_
FIRST_HR SBP DBP D-HR D_SB
B0 180 96 -6 4
}ATE;
70 120 802 10
As you can see,the LAG function only executes when we are reading the first or la
visit for each patient. When we read the last visit (LAST.ID is true), the difference
the current value minus the value the last time the LAG function executed-which w
the first visit. So,when LAST.ID is true, we output the observation.
H. COMPUTING FREQUENCIES
ON LONGITUDINALDATA SETS
To compute frequencies for our diagnoses,we use PROC FREQ on our original da
set (PATIENTS). We would write:
PROCFREQDATA=PATIENTS ORDER=FREQ;
TfTLE "Diagnoses in Decreasing Frequency Order";
TABLESDX;
RrI{,
Notice we use the DATA: option on the PROC FREQ statement to make su
we were counting frequencies from our original data set.The ORDER: option allo
us to control the order of the categoriesin a PROC FREQ output. Normally, the diagn
sis categoriesare listed in sort-sequenceorder. The ORDER:FREQ option lists the
agnosesin frequency order from the most common diagnosisto the least.While we a
on the subject,another useful ORDER: option is ORDER:FORMATTED. This w
FF.
i.
rlhe LAG
ID DX FIRST.ID
5 13 1
5 I4 0
5 I4 0
5 1A
l+ 0
7 l4 1
7 20 0
9 137 1
DATADIAG;
:TS SET PATIS{TS;
REQonouroriginaldata BY ID DX;
IF FIRST.DX;
RUN;
PROC FREQ DATA=DIAG ORDER=FREQ;
TABLES DX;
R{IN,
:,i ,
2. In a "real" study.we would probably entcr the teacher's name antl age only oncc in a scparate data set and comhine that
data set with the student data later on. saving some typing. However. tbr this examplc. it is sinipler to include thc teacher's
As a first step,let's see how we can compute the mean pretest and posttest,and
gain scoresfor each teacher. Look at the following program:
DATA SCHOOL;
T,F\TcTH r:F\ITIER q 1 .IEAa-T{FR
\ I J c
I V C' 6)
INPUT SUBfECT
atltnnFp (
.r'tracHFP (
T-AGE
PRETEST
D/lcmFqr.
T h e M E A N SP n o c e d u r e
pretestand posttest,and
N
TEACHER 0bs V ar i a b l e M
BLACK PRETEST z 43
POSTTEST 75
GAIN 31
HAYES PRETEST 62
POSTTEST 84
GAIN zz
JONES PRETEST 72
POSTTEST 86
GAIN 14
Instead of just printing out the results,we want to create a new data set that has
TEACHER as the unit of observation instead of SUBJECT. In our example, we have
only five teachers,but we might have 100, and they might be using different teaching
methods and be in different schools,etc.To create the new data set,we do the following:
PROCMT,ANSDATA=SCHOOL
NOPRfIIT NWaY; @
CI,ASS TEACHER;
VAR PRE"IEST POSTTEST GATN;
ourPu'I orn=TEAcHsUM@
MEAN=M PRE M POST M GAIN;
RUN;
*To get a list of what was produced and therefore what
is contained in the data set TEACHSUM,add the following:,'
PROC PRI}TI DATA=TEACHSUM;
TITLE "Listing of Dat.a Set TEACHSUM";
RUN;
*Hey! This is a good exanple of why coments
are useful. ;
The NOPRINT option on the first line O tells the program not to print the re-
sults of this procedure (sincewe either already have them from the last run, or the list-
ing would be too large to want to look at). As an alternative,you can use PROC
SUMMARY without the NOPRINT option.It is equivalent to PROC MEANS with
the NOPRINT option. Take your pick. We want the computed statistics (means in this
case)in the new data set.To do this, we include an OUTPUT statement@ in PROC
MEANS.The OUTPUT statementcreatesa new data set.We have to give it a name of
our choosing(by saying OUT : TEACHSUM), tell it what statisticsto put in it, and
w h a t n a m e st o g i v e t h o s es t a t i s t i c s .
We can output any statisticsavailable with PROC MEANS by using the PROC
MEANS options (N, MEAN, STD, etc.) as keywords in the OUTPUT statement.
These statisticswill be computed for all the variables in the VAR list and will be broken
down by the CLASS variable.Sincewe want only the score meansin this new data set,
we said, "MEAN : M_PRE M_POST M_GAIN.' These new variablesrepresent the
means of each of the variables listed in the VAR statement. in the same order the
give us only results for each TEACHER (the CLASS
grand mean in the new data set. Don't forget this.We
tte a new data set that has you leave this out.Your new data set (the listing from
'.
In our example,we have
reusing different teaching
ta set,we do the followine;
Listing of Data Set TEACHSUM
1 BLACK 1 z 43.500
z HAYES 1 62.333
JONES 1 72.333
4 SMITH 1 + | .ooo/
5 WONG 1 49.500
r what
'llowing:,.
Let's leave the explanations of the _TYPE_
The variable _FREQ_ gives us the number of obser
for each value of the CLASS variable. If you go b
you will see that teacher BLACK had two studen
and so forth.
What if you wanted the teacher'sage in this ne
age to gain score,for example)? This is easily accom
ment as part of PROC MEANS. So, to include the t
gram not to print the re-
would use the followins code:
m the last run, or the list-
:ive,you can use PROC
to PROC MEANS with
I s t a t i s t i c(sm e a n si n t h i s PROCI'IEANSDATA=SCHOOL
NOPRllff NWaY; O
f statement@ in PROC CI"ASS TEACHER;
ID T_AGE;
haveto give it a name of
\IAR PRETEST POSTTEST GAIN;
statisticsto put in it, and
OUTPU| OU|=TEACHSUM O
MEAN=MPRE M POST M GAIN:
\NS by using the PROC RUNr
LeOUTPUT statement.
R list and will be broken
: a n si n t h i s n e w d a t a s e r ,
The resulting data set (TEACHSUM) will now
r variablesrepresentthe
an alternative, you could have included both variabl
:, in the same order the
CLASS variables with the same result.
01 M North 70 200
02 M North 72 220
03 M South 68 155
'74
04 M South 2t0
05 F North 68 130
06 F North 63 110
0l F South 65 740
08 F South 64 108
09 F South 220
10 F South 6l 130
DATA DEMOG;
q.
LNGTH GE}IDER$ 1 REGION$
INPUI SUBJ GE}JDER$ REG]ON $ HEIGHT WEfGHT;
DATAIINES;
01- M North 70 200
02 M North 72 220
03 M South 68 155
04 M South 74 21,0
05 F North 68 1-30
06 F North 63 1-10
07 F South 65 1-40
0B F South 64 108
09 F South 220
10 F South 61 1,30
To compute the number of subjects,the mean, and the standard deviation for each
combination of GENDER and REGION. include a CLASS statement with PROC
MEANS like this:
200 Remember that you do not have to sort your
220 statement with PROC MEANS. In this example, we h
155 of one. The output from this procedure is shown next
270
130
110 o u t p u t f n o m P R o CM E A N S
140
T h e M E A NP
Sr o c e d u r e
108
220 N
GENDER REGION 0bs Vaniable
130
Nonth HEIGHT 2
WEIGHT 2
South HEIGHT
WEIGHT 4
North HEIGHT a
WEIGHT z
South HEIGHT z
WEIGHT 2
PROCMEANSDATA=DEMOG NOPRINT; O
CI,ASSGUVDERREGION;
VAR HEIG}M WEfGHf;
OUTPUTOUT-SUMMARY@
MEAN=M_HETGHT M-WEIGHT ;
RUN;
***Ad.d a PROC PRIlfff t.o lisL the observations
PROC PRII\II DATA=SIJI4MARY i
mdard deviation for each TITT.E 'Listing of Dara Set SUMMARY";
SS statement with PROC RUN;
to create a new data set, to select which statistics to place in this data set, and what
names to give to each of the requested statistics.The name of the output data set is
placed after the OUT: keyword. The request to output means is indicated by the
keyword MEAN: O.The two variable names following the keyword MEAN: are
names you choose to represent the mean HEIGHT and WEIGHT, respectively.The
order of the names following MEAN: corresponds to the order of the variable
names in the VAR statement.In this example, the variable M_HEIGHT will repre-
sent the mean height, and the variable M_WEIGHT will represent the mean weight.
Other keywords (chosen from the list of statistics available with PROC MEANS
found in Chapter 2, Section B) can be used to output statisticssuch as standard devi-
ation (STD:) or sums(SUM:).
Using a PROC PRINT with DATA:SUMMARY to see the contents of this
new data set, we obtain the following listing:
Besidesthe mean for each combination of GENDER and REGIO\ we see there
are five additional observations and two additional variables,_TYPE_ and _FREQ_.
Here's what they're all about. The first observation with a value of 0 for _TYPE_ is the
mean of all nonmissing values (9 for HEIGHT and 10 for WEIGHT) and is called the
grand mean.The two observationswith _TYPE_ equal to 1 are the mean HEIGHT and
WEIGHT for each REGION;the next two observations with _TYPE_ equal to 2 are
the mean HEIGHT and WEIGHT for each GENDER. Finally, the last four observa-
tions with _TYPE_ equal to 3 are the means by GENDER and REGION (sometimes
called cell means). This is getting complicated! Relax, there is actually a way to tell
which _TYPE_ value correspondsto which breakdown of the data.
this data set. and what
re of the output data set is Binary _TYPE-
meansis indicated by the
Mean ov
the keyword MEAN: are
REGION
VEIGHI respectively.The
U 1 1 Mean fo
the order of the variable
1 0 2 Mean fo
le M_HEIGHT will repre-
1 1 3 Mean fo
epresentthe mean weight.
and REG
IbIE With PROC MEANS
i t i c ss u c ha s s l a n d a r dd e v i -
Next, we can come up with a simple rule. When
lo see the contents of this
binary,gives you a "1" beneath a CLASS variable,
that variable. If we look at TYPE : 1, we write th
and realize that the TYPE-: I sGtisticsrepresen
c o n f u s e d ?I t s O K . t h i s i s n o t e a s y .
An alternative to interpreting the _TYPE_ varia
PROC MEANS (or PROC SUMMARY) option CH
,IGHT M-WEIGHT
tion, the _TYPE_ variable is a charactervariable cons
8889 16 2 . 3 0 0 how this works, let's run the previous program with the
2500 16 5 . 0 0 0
6000 16 0 . 5 0 0
4000 13 9 . 6 6 7
PROC MEANS DATA=DEMOG NOPRINT C}IARTYPE;
0000 19 6 . 2 5 0
5000 12 0 .0 0 0 CI,ASS GENDER REGION;
3333 14 9 .5 o O VAR HEIG}IT WEIGHT;
0000 2 10 . 0 0 0 OIIIPUT OUI=SIIMMARY
0000 18 2 . 5 0 0 MEAN=M HEfGHT M WEIGHT:
RUN;
ndREGION,we seethere The resultins data set SUMMARY now looks like th
s,_TYPE_and _FREe_.
lueof 0 for _TYPE_ is the
EIGHT) and is calledthe L i s t j . n g o f D a t a S e t SUMMARY
e themeanHEIGHT and
h _TYPE_equalto 2 are GENDER REGION TYPE FREO M_HE
tlly,the lastfour observa-
00 10 6 7 .B
nd REGION (sometimes 01 4 68.2
r is actuallya way to tell
data.
Notice that the values of *TYPE- are now strings of 1s and 0s.You can use this
variable to selectwhich meansyou are interested in. Supposeyou wanted a separatedata
set for each of the -TYPE- values.You can create several data sets at one time. like this:
This program demonstratesseveral things.First, you can create more than one SAS data
set in one DATA step.To do this, you list all the data sets you want to create on the
DAIA statement.Next, you use an OUTPUT statement to force an output at that point
in the DATA step.You also need to name the data set you want to output. Otherwise,
SAS will output an observation to all the data setslisted in the DATA statement.Finally,
you can see how the -TYPE- variable lets you choose which sets of means you want to
output. Using the CHARTYPE option with PRoc MEANS really makes the process
of choosing the correct value of -TYPE_ much easier.You don't even have to know
how to count in binary!
For most applications,you don't even need to look at the _TYPE_ values. Since
most applications call for cell means (the values broken down by each of the classvari-
ables),you will want the highest value of the _TYPE_ variable. If you include the option
NWAY on the PROC MEANS statement, only cell means will be output to the new
data set. So, if you only want the mean HEIGHT and wEIGHT for each combination
of GENDER and REGION. you would write the PROC MEANS statements like this:
RUN;
; and0s.You can usethis
ouwanteda separatedata
setsat onetime.like this: Listing of Data Set SUMMARY
1 F North a z
2 F South 4
3 M Nonth 3
4 M South z
!tr;
Observe that the value for N_HEIGHT is 3 for femalesfrom the South,while the
value of _FREQ_ is 4 (there was a missingHEIGHT for a female from the South).
Finally,if you use the NWAY option, there is not much need to keep the _TYPE_
variable in the output data set.You can use a DROP: data set option to omit this vari-
able.The program, modified to do this,is shown here:
PROCMEANSDATA=DEtrIOG
NOPRINTNWAY;
CI,ASS GMiDER REGION;
\IA,R HEIGHT WEIG}TI;
OIIIPUT OLn = SUMI"IARY{DROP=_WPE_)
N = N_HEIGIIT N_WEIGHT
MEAN = M HEIGHT M WEIG}[I;
RUN;
Using this method, the variable names in the new summary data set will be the
same as those listed on the VAR statement.That is, the variable name representing
the mean height will be HEIGHI and the variable name representing the mean
weight will be WEIGHT. This is probably a bad idea since you may get confused and
not realize that a variable name representsa summary statistic and not the original
value. (Actually, that other author would not even put in (DROP:_TYPE_) since it
takes up too much time, and he doesn't mind the extra variable in the printout.)
71.0000 182.5 Suppose you want the number of nonmissing v
minimum, and the maximum value for each combina
in the DEMOG data set.The following program wou
)sfrom the South. while the
femalefrom the South).
needto keep the _TYPE_
PROC MEANS DAfA;DEMOG NOPRII{I NWAY;
setoptionto omit this van-
CI,ASS GH{DER REGION;
VAR HEIGI{T WE]GHT;
OUfPUf OIJ'I = SUMMARY
MEAN = MEAN-HEIGHT MEAN:I{EIGHT
N = M_HEIG}{I N_WEIGHT
MEDIAN = MEDIAN_HEIGITT MEDTAN*WEI
MfN - IvIfN_HEIGHT MIN_WEIGHT
I4AX = MAX-HEfGIII I4AX-WEIGHT;
RUN'
PROBLEMS
Remember,you can download all the data setsand programsfor theseproblems from the web
site:wwwprenhall.com/cody
4.1 We have collecteddata on a questionnaireas follows:
Starting
Variable Column Length Description
ID 1 J SubjectID
DOB 5 8 Date of birth in
MMDDYY format
ST-DATE t3 Startdatein
MMDDYY format
END-DATE 21 Ending date in
MMDDYY format
SALES 29 Total sales
Here is somesampledata:
L2
L2345 67 89 0L2345 6'789 0t23 45 67 I 9 0 Colunn Indicators
00t I02I1"94611L2L9
80722819887343
002 091319550202L980020419880123
005 06061940 03L21_98103L220040000
003 07051944111s19801-11320009544
salesper year computed in part (c). Use the MM
HEIGHT
N W EI G H T - N (e) Modify the program to compute AGE as of
rounded to the nearest 10 dollars. Tiy using the
z
e 4.2 Run the following DATA step to create a SAS da
4
z
new SAS data set called AGES that contains all the
z new variables.One is AGE_ACTUAL, which is the
2005. The second is AGE_TODAY, which is the ag
WEIGHT rounded to the nearest tenth of a year. The thir
Max
dropped, as of the date stored in the variable VISI
130 new data set.
220
220
210
t**Program to create data set AaC_CORP'
DATA ABC-CORP;
DO SUBJ = 1 t0 10t
DoB = MI(FANUNI (1234) *15000) ;
theseproblemsfrom the web V I S I T _ D A T E = I N T ( R A N U N I{ 0 ) * 1 0 0 0 ) + ' 01JA
nTmDtm.
B{D;
FORMATDOB VISTT-DATS DAfE9.;
RUN;
Description
4.3 For each of eight mice, the date of birth, date of dise
iectID
: of birthin addition, the mice are placed into one of two groups
DDYY format compute the time from birth to disease,the time fr
t datein death. All times can be in days.Compute the mean, s
DDYY format of these three times for each of the two groups. Here
ingdatein
DDYY format
I sales
RAT_NO DOB DISEASE
1 23MAY1990 23JUN1990
2 21MAY1990 27JUN1990
3 23MAY1990 2-5JUN1990
A
27l[{A^Y1990 07JULl990
22MAYt990 29JUN1990
6 26MAY1990 03JUL1990
7 24MAY1990 0lJUL1990
8 29MAY1990 15JUL1990
*Use Lm{cTtl statement to control the order of
variables in the data set;
ffl ffiT.I'E'H:-'rsrr8;
DO PATIEITI = 1 T0 25;
IF RANUNI(135) LT .5 TTIHVGENDER.= 'Fernale';
ELSE GHIDER = 'Mal-e';
x = RANUNI(135)t
I F X L T . 3 3 T H S { G R O U P= ' A ' ;
E L S E I F X L T . 6 6 T H E N G R O U P= ' B ' ;
E L S E G R O U P= ' C ' ;
Do wsrT = 1 TO IltT(RANUNI(135)*5);
IF VISIT = 1 THEII DO;
D A T E - V I S I T = I l f T ( R A N U N I( 1 3 5 ) * 1 0 0 ) + 1 5 8 0 0 ;
WEIGTff = Ibrr{RANNOR(135}*10 + 150};
B{D;
H,SE DO;
DATE-VISIT r DATE*WSIT + VISIft{10 + :1,{T{RANIjNI(1"35}*50));
#;:,:..'_ ":_:"u'-"'i
IF RANUNI(l35) LT .2 TTIENLEAVE;
END;
END;
DROP X;
FOR},'ATDATE_\NSIT DATE9. ;
RUN,.
*4.5 Using the data set (PATIENTS)describedin SectionD of this chapter(usethe pro-
gram with sampledata), write the necessarySAS statementsto create a data set
(PROB4_5)in which the first visit for eachpatient is omitted.Then,usingthat data set,
compute the mean HR, SBP,and DBP for each patient. (Patient 9 with only one visit
will be eliminated.)
4.6 UsingdatasetCLINICAL from Problem4.4,createa new SAS dataset(CHANGE) with
one observationper subjectwith the differencein WEIGHT betweenthe first and last
*4.9 We have a data set called BLOOD that contains from
Each observation contains the variables ID, GROUP,
RBC (red blood cells).Run the following program to
I
i
i We want to create a data set that contains the mean W
I new data set should contain the variables ID, GRO
! M_WBC and M-RBC are the mean values for the su
Ii subjectsfrom this data set who have two or fewer obs
sume there are no missing values).
HtNr: We will want to use PROC MEANS with
both ID and GROUP in the new data set, you can m
rf this chapter(use the pro- include an ID statement (lD GROUP;) to cause the v
mentsto create a data set output data set.Also, remember the _FREQ_ variabl
:d.Then,usingthat data set, be useful for creating a data set that meets the last
'atient9 with only one two or fewer observations.
visit
*4.10 Using data set CLINICAL from Problem 4.4, creat
\S dataset(CHANGE) with mean, median, and standard deviation broken down
T betweenthe first and last statement).Using this summary data set, create four
to controlwhich observationsgo into eachof the datasets.Note that you can accomplish
this in one DATA step.Use the CHARTYPE option to make this problemeasier.