Vous êtes sur la page 1sur 28

String functions

Search for a word in a string: 


 gen x = regexm(var1,"RSV")     searches var1 for direct match of words "RSV" with 1 if yes and 0 if no
 gen x1 = regexs(1) if regexm(name, "([a-zA-Z]+)[ ]*([a-zA-Z]+)") Extract 1st word from string "name"
 gen x2 = regexs(2) if regexm(name, "(([a-zA-Z]+)[ ]*([a-zA-Z]+))") Extract 2nd word from string "name"
 gen x=strpos(var1,"Dr") >0  codes 1 if "Dr" appears in variable, 0 if not
 gen strpos("stata","a") returns 3 (string position of letter a)
 For tips on regexm code: http://www.ats.ucla.edu/stat/stata/faq/regex.htm : how to extract complex numbers or letters fro

 
Extract dates from strings
 regexs(0) if regexm(date,"[0-9]*$") extracts year from a stata date field (%td format 23jan2009)
 regexs(0) if regexm(date,"[a-zA-Z]+"   extracts month from 23jan2009
 regexs(0) if regexm(date,"^[0-9]+")  extracts day from 23jan2009
 
  Split a string into many variables according to defined split point  eg: "Diaganosis:cardiac" to "Diagnosis" "cardia
 split var1,p(:) limit(2)       splits string "var1" at  colon ":" in the string for 2 occasions (defined by "limit")
 
Extract letters from string
 substr(var1,1)  chooses first letter of string
 substr(var1,1,2)  chooses first 2 letters of string
 substr(var1,-1) chooses last letter of string
 substring (name,1,comma-1)…extracts from name to first comma
 substr("abcdef",2,3) gives "bcd"
 substr("abcdef",-3,2)  gives "de"
 substr("abcdef",2,.)  gives "bcdef" substr("abcdef",-3,.) = "def"
 
Isolate parts of string
 gen x = strpos(diagnosis,"RSV") >0  codes 1 for any time the string "RSV" appears in the field (if it appear
  regexm(var1,"RSV")  codes 1 if word "RSV" appears in var1 string 

 
Join strings together (needs egenmore.ado) Concatenate
 egen initials = concat(x1 x2)
 egen newvar = concat (x y z), punct(-)  ..creates x-y-z, use any punctaution character in brackets of punct()

Covert String to number 


 encode  ethnic_str,  gen(ethnic)  

Extract date (12/03/2009)  from a long date format (12/03/2009 00:00:00)


 gen x=substr(date,1,10)

 Replace 
 replace myvar = myvar[ n-1] if missing(myvar)  ....replace missing value with previous value
 replace x= round(x,1)......round a value to 1 decimal place
 replace x = 1 if _n<= 100 ...replace first 100 values 
 replace x = 1 if (_N - _n) < 100....replace last 100 values 
 replace x = 1 in 4 ....replaces 4th value in x with 1
 replace x=1 in 5/25  ....replaces from value 5 to 25

Stats: Descriptive
 summarise var1.........this gives mean
 summarise var1,detail (this gives full breakdown of stats such as mean, SD, min, mx etc)
 values are stored and can be retrieved with return command eg r(mean), r(p75), r(p25), r(sd) SEM = r(
 to see return lists type return list
 statsby mean=r(mean) sem=r(sem) size=r(N), by(group) gives grouped stats

Stats for all groups of categorical variable


 table x, contents (freq mean var1 sd var1)
 table x, contents(freq median var1 p25 var1 p75 var1)    ....stats on categorical  variable x including qu

 table x y, contents(freq mean var1)  ................................stats on two categorical variables (x and y)


 tabi x y / z w column chi2  exact   gives quick way for categoricla stats
 table  slow  class  ethnic,  contents(mean days  sd  days) by(gender) format(%4.1f)

Stats on multiple  variables grouped using by command 


 tabstat age weight height, statistics (n mean  sd) by (sex) format(%4.1f)
 tabstat x, by(group) stat (N)

 tabstat x if y==1, by(year) stats (median min max)

How to make stats per quartile of a variable


 xtile x = age(nq(4)
 tabstat x, statistics(n mean sd), by x

Populate a new variable with stats by a grouped categorical variable


 by group: egen var_new = median (y)
 by id:egen median_weight=median(weight)

 bysort yr:egen x = rank(lov)


 by group: egen var_new = count (y)
 by group: egen var_new = sum (y)
 by group: egen var_new = pctile (y),p(75)     ie 3rd quartile
 bysort yr: egen x = mean (y)
 bysort yr:egen z = x-y if z==3 & y==4
 bysort group:tabstat los, by(sex) statistics(count mean)
 by yr:egen ranked_set = _n  (sequential case number by group eg year)

Make a new dataset of descriptive stats using collapse function


 collapse (p25) x, by (group)
 collapse (mean) x y

 return values that can be used  include mean,median,p50,p75,p25,sd,semean,sum,count,rawsum,min,m

Cumulative sum by group


 by group, gen tot = sum (x)

Highest value record


 egen high = record(wage), by(group) order (yr)

Count distinct values in a variable

bysort x y: generate count = (_n==1)


by x:replace count = sum(count)
by y:replace count= count(N)

Other method:

egen tag = tag (x y)


egen count=total(tag), by(x)

or egen count=nvals(x), by(y)


Stats: Basic stuff
Basic stats:                                      good reference: http://www.ats.ucla.edu/stat/stata/whatstat/whatstat.htm

 summarise x 
 summarise,detail x  (this gives details such as percentiles)

Stats by group: 

 bysort yr:tabstat bedays,by(mo) statistics(median, p25 p75)


 table mo yr, contents(median  bedays )

 tab x, contents(mean variableX sd variableX count variableX)


 GENERERATE  median value for each SUBGROUP

 by subgrp, sort: egen medstay = median(los)

GENERATE the deviation from the median length of stay

 generate deltalos = los - medstay

SUBGROUP AGGREGATES (eg by month)

 by mnth, sort: egen monthmedn= median(daymax)


 by mnth, sort: egen monthmax= max(daymax)

 by mnth, sort: egen monthmin= min(daymax)

 by mnth, sort: egen month25= pctile(daymax),p(25)

 by mnth, sort: egen month75= pctile(daymax),p(75)

 Frequency distributions

 tab x y
 tab x y, row         gives % for row  (can also use column)

 tab x y, chi2        for chi squared (use exact for fishers)

Fishers or CHI squared


use cci command: 
cci 12 23 24 56, exact    .....will give this output:
    |   Exposed   Unexposed  |      Total     Exposed

-----------------+------------------------+------------------------

           Cases |        12          23  |         35       0.3429

        Controls |        24          56  |         80       0.3000

-----------------+------------------------+------------------------

           Total |        36          79  |        115       0.3130

                 |                        |

                 |      Point estimate    |    [95% Conf. Interval]

                 |------------------------+------------------------

      Odds ratio |         1.217391       |    .4710021     3.04953 (exact)

 Attr. frac. ex. |         .1785714       |   -1.123133    .6720806 (exact)

 Attr. frac. pop |         .0612245       |

                 +-------------------------------------------------

                                  1-sided Fisher's exact P = 0.4023

                                  2-sided Fisher's exact P = 0.6669


 
 
 tab x y, chi2 nof column row  (nof does not show frequencies.Other stats are exact for FIshers)
 tabi 54 43 \ 56 78, column chi2  is chi2 with raw data
 tab1 x y z   produces oneway frequency for multiple variables
 tab1 varx-varz
 by group, sort: tab x y, nofreq col chi2
 tab3way is good ado file for multi-column
 Odds ratio

 cci 21 16 1 4 , exact
 exactcci 21 16 1 4, exact










  
 Diagnostic tests (sensitivity, specificity, NPV, PPV) use module "diagt"

 diagti 80 17 11 44

Showing means and discriptive stats in tables


 tab x, summ(y)  shows basic stats (mean sd freq) for groups in x
 tab x1 x2, summ(y)  means   two way table of x1 and x2 with means of y
 table x1 x2, contents(mean y1 median y2)
 stats(q) gives interquartile range
 
for tab command:
 ,cell gives % for each cell
 ,expected  gives expected distributions
 ,generate(new) plots dummy variables  eg. tab x1,gen(dummy)  produces dummy1 dummy2 dummy3 for each
 ,lrchi  likelihood ratio
 ,missing
 ,nofreq
 ,nolabel
 
Multi-table frequencies
 table y x2 x3, by(x5 x6) contents (freq)  
 by x3, sort: tab x1 x2, exact  (This is a 3 way table)
 
other stats for contents:  
 freq,mean,sd,sum,rawsum,count,n,max,min,median,iqr,p1,p99,p75
 use format tab x y, contents(mean z median z)
 tab died yr, summ(pim) means    two-way table of means
 tab x , contents(mean z)
 
Table function with contents
 tab group cs, sum(los) means
 table group cs, contents (count los median los)
 table inhouse yr, contents (count los median los)
 table band2 yr, contents (count los)table age yr, contents (count los)
 table x y, by(z)
 
Confidence intervals
 ci x, level (99) produces 99%confidence interval once x is summarised
TABSTAT   tabstat x,stats(….)
 
 tabstat pop, stat(mean), by (size)
 tabstat lov_days, by(yr) stat(mean sd min max) nototal long
 tabstat lov_days, by(yr) stat(n q) nototal long
 tabstat lov_days, by(yr) stat(mean sd min max) nototal long
 tabstat lov_days, by(yr) stat(n q) nototal long
 tabstat x, stat(mean, count, median), by(var2)
 tabstat x, stat(count mean q) by(y)    q is interquartile range 
 by x:tabstat y if z==2 & q==0, summarize(n mean q)
 record(x) is highest value of x (egenmore function)
 
 
     these are options between brackets:
·        mean 
·         count (count of nonmissing observations)
·        n same as count
·         sum sum
·         max maximum
·         min minimum
·         range range = max - min
·         sd standard deviation
·         sdmean standard deviation of mean = sd/sqrt(n)
·         skewness skewness
·         kurtosis kurtosis
·         median median (same as p50)
·         p1 1st percentile
·         p5 5th percentile
·         p10 10th percentile
·         p25 25th percentile
·         p50 50th percentile (same as median)
·         p75 75th percentile
·         p90 90th percentile
·         p95 95th percentile
·         p99 99th percentile
·         iqr interquartile range = p75 - p25
·         q equivalent to specifying "p25 p50 p75"
 
OTHER BASIC STATS TESTS
 
Skewness test
 sktest x
 swilk x (shapiro wilks)

 ladder x (produces powers with skewness test for normality)

 gladder x (plots various distributions of x)

Parametric  One sam 

ple t-test:                

  ttest write=50  (does mean differ from 50)

Non Parametric  One sam 

ple  ( 

Wilcoxon signed-rank test)

 signrank write=50  (eg. does median differ from 50)

Binomial test          bitest female=.5  (eg. does proportion differ from 50%)

Parametric Two independent samples t-test

  ttest x, by(group)
Non Parametric Mann-Whitney test

 ranksum x, by(group)

Parametric Paired t test   


  ttest x = y

Non parametirc Paired (Wilcoxon)  


 signrank x=y

Parametric One way anova: anova x y

Non Parametric Kruskall Wallace : kwallis x, by(y)

Date stuff
Import dates from excel in dd/mm/yyyy/ hh:mm  format (eg 12/03/2009 12:33)
 gen double dt = clock(datevariable,”DMYhm”)
 format dt %tc

 label variable dt “Date”

 To convert clock date (dd/mm/yy hh:mm) to dd/mm/yyyy (eg 12/03/2001)

 gen new_date = dofc(dt)


 format new_date  %td

 To subtract dates with result in hours: (dates in %td format)

 gen double hrs = hours(end_dt – start_dt)                        *


 gen los_days = los_hr/24                                                        *

 format los_days %9.1f

 To generate dd/mm/yyyy date:

 gen birth = date(dob,”DMY”)


 format birth %td

 label variable birth “Date of Birth”

To convert STRING date in long  format (12/03/2009 00:00:00) to short format (12/03/2009)

 gen x=substr(date,1,10) ....this keeps first 10 characters of string


 then use date(x,"DMY")  to format the date as td format

Script to convert clock date to multiple formats

gen before = cond(hired_on < td(15jun2004), 1, 0) if hired_on < .

drop if admitted_on < tc(15jun2004 12:00:00)

gen date_tc = clock(x,"DMYhm")       // format structure is 12/03/2006 12:30


gen date_td =dofc(date_tc)               //convert to 12/03/2006   DAY/MONTH/YEAR format

gen date_tm=mofd(date_td)              //convert to 2006/03  YR/MONTH format

format date_tc %tc

format date_td %td

format date_tm %tm

label variable date_tc "Date Clock"

label variable date_td "Date"

label variable date_tm "Yr-Month"

gen yr=year(date_td)         //year

gen mo=month(date_td)    // month

gen day=day(date_td)       //day

gen doy=doy(date_td)      //day of year

Date stuff (stata 11)


Dates are set from jan 1 , 1960
 

Format Description

%td daily

%tw weekly

%tm monthly

%tq quarterly

%th half-yearly

%ty yearly

 
 generate y = date(doa, "DMY")
 format y %td

Date field in "12/03/2009 23:34" format , use clock function (1 unit = 1 millisecond)


gen double dt_clock= clock(datevariable,"DMY hm")
To convert clock format  to date format  use dofc  
gen newdate = dofc(date_clock)   ie 12/03/2009            Format x d%  or format %tc  gives hh:mm as well

Date in "12/03/2009" format, use date function (1 unit = 1 day)


gen y =date(x,"DMY") 

 if year is in format 09  instead of 2009, precede "DMY" by century  eg "DM20Y"   


 if date spans centuries, use (x,"DMY",2020) for  1998 and 2000 (use largest century date)
 
 
 generate birthday=mdy(month,day,year)
 generate m=month(birthday)

 generate d=day(birthday)

 generate y=year(birthday)

 dow(x)  date of week 

 generate weeks = diff/7

 generate months = diff/30.5


 generate years = diff/365.25

 
Other functions:
 weekly(x,"wy")
 monthly(x,"my")
 quarterly(x,"qy")
 halfyearly(xr,"hy") 
 yearly(x,"y")
 
If three columns for each day,month and year use:
 gen y = mdy(month,day,year)   
 gen x = mdy(x,y,z)

 
 

mdy(month,day,year) daily

yw(year, week) weekly

ym(year,month) monthly

yq(year,quarter) quarterly 

yh(year,half-year) half-yearly 

 
Translating to td% dates (DD/MM/YYYY)

dofw() weekly to daily

dofm() monthly to daily

dofq() quarterly to daily

dofy() yearly to daily

 
Translate from %td dates:
 

wofd() daily to weekly


mofd() daily to monthly

qofd() daily to quarterly

yofd() daily to yearly

 
 To reference dates :
 reg x y if w(1999w10)
 sum salary if q(1998-4)

 tab sex if y(2007)

To reference range of dates use the tin() and twithin() functions:


 reg y x if tin(01feb1998,01jun1998)
 sum income if twithin(1990-1,1990-3)

tin() includes the beginning and end dates, twithin() excludes them

Stata: Visual date displays

 %tc | mdyhms(M, D, Y, h, m, s)

 %tc | dhms(td, h, m, s)

 %tc | hms(h, m, s)

 %td | mdy(M, D, Y)

 %tw | yw(Y, W)

 %tm | ym(Y, M)

 %tq | yq(Y, Q)

 %th | yh(Y, H)

 %ty | Y

clock values (%tc) for data in format "12/03/2009 23:34"

format x = hh(x) shows hours. Mm(x) or ss(x)

gen x = mdy(m,d,y) or mdyhs


gen bdayday = day(bdaynew)

gen bdaydow = dow(bdaynew)

gen bdaymo = month(bdaynew)

gen bdayyr = year(bdaynew)

Convert date times  (NOTE %tc is milliseconds, %td is seconds, %tm is months)

 %tc to %td use dofc(x) ie from "12/03/2009 23:34" to "12/03/2009"

 %td to %tm use mofd( ) 

 %td to %tq use qofd( )

 then can apply year( ) month( ) day( ) doy( ) dow( ) this is day of yr from 1-365 or day of week

 halfyear( ) quarter( ) week( ) dow ( )= sunday

Conditional date arguments

 gen before=cond(adm<td(15jun2006),1,0

 list if !inrange(x,2,10) lists if not in range 2 to 10

 list age if inrange(population,200,5000)

 gen byte x = inlist(x,"one","two")

 egen x =rcount(v1-v4),cond(@>5 & @>15)

 by year,sort:egen y=sum(died)

generate bdaynew=date(bday,"mdy", 2010) if data as 02/03/07

Date script
TYPICAL DATE SCRIPT (assume date in string format as dd/mm/yyyy hh:mm:ss) and you w

 gen x =substr(date,1,10)  //converts string date to dd/mm/yyyy string

drop date
gen date= date(x,"DMY")  //converts date to %td format

format date %td

label variable date "Date"

gen yr = year(date)

gen mo = month(date)

gen day=day(date)

gen month_yr =mofd(date) //convert to month yr format

format month_yr %tm

FOR FULL CLOCK FORMAT use (assume date in string format as dd/mm/yyyy hh:mm:s

gen date2 = clock(date,"DMYhms")

format date2 %tc

Graph tips
Histograms
 histogram x, by(group, total) percent bin(10)
 histogram x, frequency title("Graph1) xlabel(15(10)30) ytick  (1(2)10) start(10) width(2) norm gap(1

  ........  start is where bar begins, width id bin size , norm overlies curve, gap

Scatterplots
 graph twoway scatter x y (can use xlabel, xtick, xtitle and also msymbol)
 mysymbol () can be:  O,o, D,d, T,t, S,s, +, smplus,X,x, (add "h" for hollow eg Oh = hollow big circl

 scatter twoway x y [fweight = age], by(group) symbol (oh) mlabel(id)  allows bubble plot with size f
 to format  axis numbers eg 1.33 to 1.3 use: ylabel(,format(%3.1f))
Lineplots
 graph twoway line x y year, legend (label(1 "label a") label(2 "label b") position(2)  ring(0) rows(2)
 this plots line x and y against time (year). the legend will be placed in the graph (ring(0) in top right

 can use xtick eg (1960(2) 1980)  for every 2 yrs


 ylabel(0(10)50),angle(horizontal))  will plot label for  y axis horizontally 
 clpattern is type of line and  can be: solid,dash,dot,dash_dot,shortdash,shortdash_dot,longdash,blank
 if two lines then specify each in the plot eg msymbol (T Oh) clpattern(dash solid) 

Barplots
 Can do summaries eg graph bar (median) x,over(group) blabel(bar,size (medium)) bar(1, bcolor(gs1
 note:  blabel puts the value on top of each bar

 bar labels can vary in size: size(small)  or tiny medsmall medlarge large
 Stacked bar graph : graph bar(sum) x y z, over(group) stack
 graph hbar for horizontal bar graph

Other graphs:
 qqplot xy
 quantile x

 qnorm x,grid

Mean /SEM type graph:


 graph twoway rcap xlow xhigh year || connect z_mean year, legend (off)   if z_mean is mean and ea
xhigh respectivelyy (eg after collapse command).
 Can use ANOVA comand to creat mean/SEM graphs by using "predict" command after ANOVA:

 anova income year


 predict income_mean (generate smean value for income)
 predict SEincome (generates Standrd error for income) then use serrbar scale(2) to plot  in
 serrbar income_mean SEincome year, scale(2) addplot (line income_mean year,clpatter

or do following:   **from Statistics with Stata by Hamilton


anova income year gender*year
predict aggmean
predict SEagg, stdp
gen agghigh = aggmean +2 * SEagg
gen agglow = aggmean -2 * SEagg
graph twoway connected aggmean year || rcap agghigh agglow year, by (gender, legend(off) note(" ")

Overlapping graphs
 graph twoay lfitci x y || scatter x y , xlabel(10(2)20) ylabel(2(10)20, angle(horizontal)) legend(order
"regression line") rows(3) position(1) ring(0)

Combining graphs

graph  twoway x y ............... saving (fig1)


graph  twoway z x............... saving (fig2)
graph combine fig1.gph  fig2.gph imargin(vsmall) rows(2)            rows is numberof rows on graph

Graph time-series
 tsset date_x,format(%td) to set data as daily where date_x is date variable
 tssmooth ma newvar = x, window (2 1 2) generates a 5 day moving average (2 lagged, current v
 graph twoway line admissions date_x      plots line graph of admissions vs date but x-axis looks
 graph twoway tsline admissions, ylabel(10(10)100) ttitle(" ") tlabel(01jan1983 01mar1983, grid
clwidth(thin) clpattern(solid)
 using tsline, because data is tsset, one doesn't reference time, surpress the title by ttilte("")

Other way to generate moving average is egen:


egen moving_av = ma(x), nmiss t(3)  gives 3 day moving average if data is tsset to daily formats

tssmooth nl command is better if outliers present

 tssmooth nl x_smooth =x, smootehr(4253h,twice) ...smooths the  running median by different s


moving avergae of span 3 according to Velleman
Date: gen addate1 = clock(doa,"DMY hm") // for PICU data where date field is
convert date 12/03/2008 12:33
to clock format addate1 %tc
value. Must
be in format *generate date format ie DDMMYYYY from clock value above
4/01/2010 12/03/09 gen addate = dofc(addate1)
19:22:24 12:34 format addate %td Dates
02/02/2010 str2time tod, ge
19:46:28 String to time conversion. http://www.sealedenvelope.com/stata/time.php eg:tod (string) 0
Covert decimal time to HH:MM format (24hrs =1) converts a numeric variable
containing elapsed times to a string variable containing times in 24 hour clock
02/02/2010 format (HH:MM or HH:MM:SS).
19:48:03 http://www.sealedenvelope.com/stata/time.php time2str etod, g
table x y z, cont
table x y z, cont
04/01/2010 stats on multiple tables of variables by group table x y z, cont
21:17:08 Can be count data or any other statistic by(group)
24/01/2010 _n manipuations: Sequentialy count of oservations in subgroups (eg count by yr, sort: gen
subgroup
by yr, sort: gen
19:19:51 from observation1,2,3,4 ... for every year, with each year starting at 1 again subgroup
i = identifier, j =

wide: devides d

eg: i=patient, j=t

reshape wide N
format to use as

eg patient with t

to move betwee

reshape long x,

reshape wide x,
24/01/2010
19:25:04 Reshape data between wide and long formats These steps “un
make file "profile
put it in root of S
24/01/2010 In this do-file pu
19:26:57 Start up do file
hl Hosmer-Leme
reformat conver
publication qual
time utilities to t
format to elapse
xcount count clu
02/02/2010 xfill fill in static v
19:43:21 Web page http://www.sealedenvelope.com/stata.php xtab tabulate lon
02/02/2010 web
20:01:20 http://www.cpc.unc.edu/research/tools/data_analysis/statatutorial/index.html University Caro
label v1mean "m
predict SEv, std
predicted mean
gen vhigh =v1m
gen vlow =v1me

graph twoway c
Plot ANOVA graph rcap vhigh vlow
24/01/2010 anova v1 gender year gender*year ")) ytitle("mean v
20:16:13 predict vmean
anova drink yea
predict drinkmea
label variable dr
predict SEdrink,
means

serrbar drinkme
24/01/2010 addplot(line drin
20:16:45 Plot ANOVA graphs with error bars (off)
24/01/2010 Plot ANOVA graphs across time if data in long fo
20:18:37 xtline info: http://www.ats.ucla.edu/stat/stata/faq/xtline.htm (id time y as var
http://www.ats.ucla.edu/stat/stata/faq/visualize_longitudinal.htm xtline y, t(time) i
if data in wide fo
variable eg id tim
profileplot time1
info:
http://www.ats.u
graph bar (sum)
graph bar (sum)
graph bar (sum)

Can use mean,


(or any function
xline(5) or y line

graph bar (med


bar(1,bcolor(gs1
blabel puts num
greyscale 
can size up labe
For second bar
24/01/2010 start :
20:20:05 Bar graphs by Group Graph bar x y,
beamplot x, by(g
catplot bar rep7
24/01/2010 catplot hbar rep
20:21:25 CATPLOT, BEAMPLOT graphs
graph hbox x,ov
intensity(30) ma

24/01/2010 intensity shadin


20:22:00 Boxplot Xline in hbox ar
error bar plot:
sort group
gen high = dep
gen low = dep -
twoway (rarea lo
(connected dep
26/01/2010 How to "collapse" data and then plot summary values: by(group) legen
16:50:04 collapse (mean) dep (sd) sddep=dep (count) n=dep, by(visit group) depression"))
26/01/2010
17:01:01 How to refer to scalar in a graph title (CUSUM G
eretrun gives r2
reg x y
26/01/2010 local r2: display
17:05:10 How to show r2 regression result in title of regression graph twoway (lfitci x y
histogram x, fre
ylabel(0(2)10) ti
options:
bin(3)
percent
gap(5)
addlabel. Adds
26/01/2010 discrete onebar
17:08:32 Histograms norm. Overlies n
26/01/2010 if hisrogram of t
17:09:09 Histogram by group by(group,total):H
26/01/2010 Histogram and boxplot on same graph. 
17:09:56 Install histbox.ado
02/02/2010
20:12:02 Add linear regression line to scatterplot
12/02/2010
16:07:03 graph descriptive data with one comand: sixplot module

21/02/2010 ERROR BAR GRAPH: Draw graph with time on x-axis and mean/SE error bars or median and IQ
08:45:17 Needs module XTGRAPH. USEFULL
21/02/2010
10:28:39 Stacked bar graph
26/01/2010
17:11:07 Fill in missing gaps in variable in bulk

24/01/2010
19:30:11 Statistics by group (eg median, means, IQR)
26/01/2010
16:48:30 Decode all missing values to a fixed number (eg 999)
26/01/2010
16:53:22 Generate random numbers and mix up the numbers randomly after this
26/01/2010
16:55:59 Removing duplicated observations in an entire file

26/01/2010
17:00:18 Counting by groups

02/02/2010 Count cluster data. XCOUNT


19:52:45 net from http://www.sealedenvelope.com/
02/02/2010
19:59:04 tab at cluster level (for panel data)
12/02/2010
16:24:34 subtract the previous value in a running calcualtion
12/02/2010 bysort command for grouped stats on a variable.
16:27:40 Does not need sorting before the command is run

12/02/2010
16:30:06 Sorting orders within groups (_N _n values)

12/02/2010
16:40:40 table of means/SD/count across groups: use oneway test
12/02/2010
16:51:36 Calculate difference in 1 value from next or sum of next value

12/02/2010
17:02:43 Counting variables by group

13/02/2010
16:09:00 Tabulate ranges of each quartile for avariable using "xtile" to split data in quartiles, and tabstat to
13/02/2010 Create a list with descriptive stats (mean, n, median etc) for a list of items in a group using EGEN c
16:26:08 which is percentile, sd, rank. USEFULL
13/02/2010
16:27:28 Setting highest value (record value) in a variable
12/02/2010
16:03:16 web: new modules from

24/01/2010
19:56:44 Goodness of fit

24/01/2010
19:57:10 T tests

24/01/2010
19:58:06 Mann Whitney Kruskall Wallace
26/01/2010
16:47:33 Adding formating to decimal places after the tab command for basic stats
26/01/2010
16:51:22 How to tabulate summary values(eg median) by group
26/01/2010 Odds ratio with Fishers exact test, knowing the numbers per group
16:52:37 eg: 21 and 16 vs 1 and 4
26/01/2010
17:11:49 Diagnostic tests (specificity and sensitivity and predictive values)
12/02/2010
16:48:53 Chi squared with known proportions or fischers exact
13/02/2010
16:04:55 Grouped table of stats in multiple rows using tabstat (medians, means etc) VERY USEFULL
16/02/2010 Generate parameter data (intercept, slope, standard erros for both) for clustered longitudinal data (
21:28:59 Generates parameters for each cluster (id).

19/02/2010 XCONTRACT: great way to generate count/percentage , cumulative percentage stats on grouped d
13:09:05 Need to download module first from SSC
04/01/2010
21:12:56 Convert string to numeric
23/01/2010
23:11:48 cumulative count by variable (_n)
Make dummy variables from a categorical variable. eg 
24/01/2010 variable "size" has size =0, size =1, size =2
10:29:00 Useful in regression
24/01/2010
19:18:32 List and define missing variables: use "codebook"
24/01/2010 Conditional argument for string:
19:23:30 if variable is string eg "alive"or dead" then convert to 1,0

24/01/2010
19:54:37 Tabstat (for statistics by groups)
26/01/2010
16:57:06 Split characters off a string in different positions

26/01/2010
16:59:27 Split the first word of a string before a delimiter eg. "John:Smith" to John where delimter is ":". De

26/01/2010
17:03:55 Summarise missing values for variables
02/02/2010
20:16:06 One to one merging (if each uniqueid has a matched uniqueid in another file
03/02/2010
14:56:41 Conditional arguments for strings

12/02/2010
15:57:11 Saving log and commands panel
12/02/2010
16:10:22 Replace contents of a string. FDTA module: fdta checks the string varibles, searching for st1. Wh

12/02/2010
16:36:06 Coding variables into categorical groups

12/02/2010
16:38:51 Quick stats (T tests, Mann whitney etc)
12/02/2010
16:43:55 Recode continuous variable into defined categories with cut command
12/02/2010
16:44:32 cumulative sum
12/02/2010
16:46:45 replace variable with previous value if a value is missing
12/02/2010
16:56:01 Missing value in if command
12/02/2010
16:58:32 Find variables in list

12/02/2010
17:00:08 recode variable
12/02/2010
17:01:02 cumulative sum by group
12/02/2010
17:07:03 number of distinct observations by group
13/02/2010
16:30:30 Separating lines in list
13/02/2010
16:33:39 List of TOTAL values by a group using EGEN command. USEFULL
13/02/2010
16:35:46 Current dates and times
21/02/2010
09:56:41 Generate a numerical sequence in an empty variable eg. 0,5,10...
21/02/2010
09:58:56 Convert a string to a number
21/02/2010
14:59:00 MISSING VALUES: Generate a variable to define if a list of variables have missing values
GROUP COMMAND
25/02/2010 Group string or continuous data into distict groups (integers 1,2,3 etc)
11:17:15 1st distinct value coded as 1, next as 2 and so on

25/02/2010
11:19:23 Number lists examples
25/02/2010
15:37:55 REPLACE MISSING value with the preceding value

25/02/2010
15:47:03 LAG value by previous or next value
25/02/2010
15:55:18 RECODE
25/02/2010
17:17:04 Referencing SCALARS in variables or in graphs titles
27/02/2010
23:11:58 DIAGNOSTIC TESTS: specificity, sensitivity, predictive values

28/02/2010
23:45:33 Bulk RENAMING VARIABLES: renvars module
03/03/2010
20:38:31 STATSBY: produce tables of discriptive stats. RE-writes data into new dta tables
04/03/2010
21:49:11

Fill a sequence (use egen)


Convert TIME in HH:MM (eg 12:33) to minutes

07/03/2010
22:29:31

13/04/2010 COunt unique distinct value sin s astring where multiple duplicates may occur. Eg each patient has
16:16:25 admitte dmany times)
15/04/2010 Create variable containing the median length of stay for each diagnostic code
15:12:36 . 
15/04/2010
17:19:29 stats for multiple groups using SORT
21/04/2010 Profile.do template file
21:49:19

21/04/2010
22:01:32 Moving average
21/04/2010
22:21:17 Breaking a categorical variable into a set of binary variables (make each categorical value a sepera
collapse (mean) dep (sd) sdd
error bar plot of this:
sort group
gen high = dep + 2*sddep/sq
gen low = dep - 2*sddep/sqrt
twoway (rarea low high visit,
21/04/2010 */ (connected dep visit, mcolo
22:23:36 ERROR BAR AREA GRAPH after collapsing data to basic stats */ clcolor(black)), by(group) /*
29/04/2010 GROUP command to combine a numbe rof variables into a unique
13:24:14 group egen newid = group( var1 va
29/04/2010 Egen faminc =sum(income),b
13:25:46 Grouping code Egen faminc =max(income),b
twoway (line beds_AM day)(l
03/05/2010 Pyramid type plot for day vs night shifts. Shows how to change ylabel(0(5)30, angle(horizont
22:50:13 graph axis label size legend(label(1 Day) label(2 N
by patientid:gen x = _n
19/05/2010 _N count for only maximal value of a grouped variable list egen y = max(x), by patientid
15:06:29 (eg making avariable with maximal value per year per patient) gen var1 = yr if x = y
15/06/2010 sparl x y
19:14:49 Scatterplot with regression line and r2 coeficiant (SPARLmodule regress
08/07/2010
00:18:17 eperiod module TIME DIFFERENC calculatedifference between

Centering data
Return lists:

following the summarize command, the following list returns are available: r(N), r(sum), r(mean),r(sd),r(min

To standardise values:
summarise age
gen age_c = (age -r(mean))/r(std)
Foreach loop to centre data: in this example  the variables los and los1 are centered creating los_c and los1_

foreach var of varlist los los1 {


summarize `var', meanonly
generate `var'_c = `var' - r(mean)
label variable `var'_c "`var' (centered)"
}

Loops
Basic loops that can be modified to perform many tasks

foreach  variable  in  v1  v2  v3  {


list  `variable´
}

foreach  number  of  numlist  1  2  3  {


display  `number´
}

forvalues  i=1/3  {
display  `i´
}

The same output can also be produced using while as follows:


local  i  =  1
while  i<=3  {
disp  `i´
local  i  =  `i´  +  1
}

Looping through a variable list


local xlist na k cl 
foreach  v  of  varlist  `xlist' {
summ  `v'
correlate  sbe `v'
scatter  sbe `v'
}

Looping through a variable for duplicates


foreach var of varlist dupl_tag{
sort dupl day h_hour h_ph sample
by dupl: gen x=h_na[_n-1]- h_na
drop if x<0.05 & dupl_tag ==`var'
drop x
}

Vous aimerez peut-être aussi