String Functions: Extract 1st Word From String "Name"

String functions
Search for a word in a string:

 gen x = regexm(var1,"RSV") searches var1 for direct match of words "RSV" with 1 if yes and 0 if no
 gen x1 = regexs(1) if regexm(name, "([a-zA-Z]+)[ ]*([a-zA-Z]+)") Extract 1st word from string "name"
 gen x2 = regexs(2) if regexm(name, "(([a-zA-Z]+)[ ]*([a-zA-Z]+))") Extract 2nd word from string "name"
 gen x=strpos(var1,"Dr") >0 codes 1 if "Dr" appears in variable, 0 if not
 gen strpos("stata","a") returns 3 (string position of letter a)
 For tips on regexm code: http://www.ats.ucla.edu/stat/stata/faq/regex.htm : how to extract complex numbers or letters fro

Extract dates from strings
 regexs(0) if regexm(date,"[0-9]*$") extracts year from a stata date field (%td format 23jan2009)
 regexs(0) if regexm(date,"[a-zA-Z]+" extracts month from 23jan2009
 regexs(0) if regexm(date,"^[0-9]+") extracts day from 23jan2009

Split a string into many variables according to defined split point eg: "Diaganosis:cardiac" to "Diagnosis" "cardia
 split var1,p(:) limit(2) splits string "var1" at colon ":" in the string for 2 occasions (defined by "limit")

Extract letters from string
 substr(var1,1) chooses first letter of string
 substr(var1,1,2) chooses first 2 letters of string
 substr(var1,-1) chooses last letter of string
 substring (name,1,comma-1)…extracts from name to first comma
 substr("abcdef",2,3) gives "bcd"
 substr("abcdef",-3,2) gives "de"
 substr("abcdef",2,.) gives "bcdef" substr("abcdef",-3,.) = "def"

Isolate parts of string
 gen x = strpos(diagnosis,"RSV") >0 codes 1 for any time the string "RSV" appears in the field (if it appear
 regexm(var1,"RSV") codes 1 if word "RSV" appears in var1 string

Join strings together (needs egenmore.ado) Concatenate
 egen initials = concat(x1 x2)
 egen newvar = concat (x y z), punct(-) ..creates x-y-z, use any punctaution character in brackets of punct()
Covert String to number

 encode ethnic_str, gen(ethnic)
Extract date (12/03/2009) from a long date format (12/03/2009 00:00:00)

 gen x=substr(date,1,10)
Replace
 replace myvar = myvar[ n-1] if missing(myvar) ....replace missing value with previous value
 replace x= round(x,1)......round a value to 1 decimal place
 replace x = 1 if _n<= 100 ...replace first 100 values
 replace x = 1 if (_N - _n) < 100....replace last 100 values
 replace x = 1 in 4 ....replaces 4th value in x with 1
 replace x=1 in 5/25 ....replaces from value 5 to 25
Stats: Descriptive
 summarise var1.........this gives mean
 summarise var1,detail (this gives full breakdown of stats such as mean, SD, min, mx etc)
 values are stored and can be retrieved with return command eg r(mean), r(p75), r(p25), r(sd) SEM = r(
 to see return lists type return list
 statsby mean=r(mean) sem=r(sem) size=r(N), by(group) gives grouped stats
Stats for all groups of categorical variable

 table x, contents (freq mean var1 sd var1)
 table x, contents(freq median var1 p25 var1 p75 var1) ....stats on categorical variable x including qu
 table x y, contents(freq mean var1) ................................stats on two categorical variables (x and y)

 tabi x y / z w column chi2 exact gives quick way for categoricla stats
 table slow class ethnic, contents(mean days sd days) by(gender) format(%4.1f)
Stats on multiple variables grouped using by command

 tabstat age weight height, statistics (n mean sd) by (sex) format(%4.1f)
 tabstat x, by(group) stat (N)
 tabstat x if y==1, by(year) stats (median min max)
How to make stats per quartile of a variable

 xtile x = age(nq(4)
 tabstat x, statistics(n mean sd), by x
Populate a new variable with stats by a grouped categorical variable

 by group: egen var_new = median (y)
 by id:egen median_weight=median(weight)
 bysort yr:egen x = rank(lov)

 by group: egen var_new = count (y)
 by group: egen var_new = sum (y)
 by group: egen var_new = pctile (y),p(75) ie 3rd quartile
 bysort yr: egen x = mean (y)
 bysort yr:egen z = x-y if z==3 & y==4
 bysort group:tabstat los, by(sex) statistics(count mean)
 by yr:egen ranked_set = _n (sequential case number by group eg year)
Make a new dataset of descriptive stats using collapse function

 collapse (p25) x, by (group)
 collapse (mean) x y
 return values that can be used include mean,median,p50,p75,p25,sd,semean,sum,count,rawsum,min,m
Cumulative sum by group

 by group, gen tot = sum (x)
Highest value record

 egen high = record(wage), by(group) order (yr)
Count distinct values in a variable
bysort x y: generate count = (_n==1)

by x:replace count = sum(count)
by y:replace count= count(N)
Other method:
egen tag = tag (x y)

egen count=total(tag), by(x)
or egen count=nvals(x), by(y)

Stats: Basic stuff
Basic stats: good reference: http://www.ats.ucla.edu/stat/stata/whatstat/whatstat.htm
 summarise x
 summarise,detail x (this gives details such as percentiles)
Stats by group:
 bysort yr:tabstat bedays,by(mo) statistics(median, p25 p75)

 table mo yr, contents(median bedays )
 tab x, contents(mean variableX sd variableX count variableX)

GENERERATE median value for each SUBGROUP
 by subgrp, sort: egen medstay = median(los)
GENERATE the deviation from the median length of stay
 generate deltalos = los - medstay
SUBGROUP AGGREGATES (eg by month)
 by mnth, sort: egen monthmedn= median(daymax)

 by mnth, sort: egen monthmax= max(daymax)
 by mnth, sort: egen monthmin= min(daymax)
 by mnth, sort: egen month25= pctile(daymax),p(25)
 by mnth, sort: egen month75= pctile(daymax),p(75)
Frequency distributions
 tab x y
 tab x y, row gives % for row (can also use column)
 tab x y, chi2 for chi squared (use exact for fishers)
Fishers or CHI squared

use cci command:
cci 12 23 24 56, exact .....will give this output:
| Exposed Unexposed | Total Exposed
-----------------+------------------------+------------------------
Cases | 12 23 | 35 0.3429
Controls | 24 56 | 80 0.3000
-----------------+------------------------+------------------------
Total | 36 79 | 115 0.3130
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Odds ratio | 1.217391 | .4710021 3.04953 (exact)
Attr. frac. ex. | .1785714 | -1.123133 .6720806 (exact)
Attr. frac. pop | .0612245 |
+-------------------------------------------------
1-sided Fisher's exact P = 0.4023
2-sided Fisher's exact P = 0.6669

 tab x y, chi2 nof column row (nof does not show frequencies.Other stats are exact for FIshers)
 tabi 54 43 \ 56 78, column chi2 is chi2 with raw data
 tab1 x y z produces oneway frequency for multiple variables
 tab1 varx-varz
 by group, sort: tab x y, nofreq col chi2
 tab3way is good ado file for multi-column
 Odds ratio

 cci 21 16 1 4 , exact
 exactcci 21 16 1 4, exact











Diagnostic tests (sensitivity, specificity, NPV, PPV) use module "diagt"

 diagti 80 17 11 44
Showing means and discriptive stats in tables

 tab x, summ(y) shows basic stats (mean sd freq) for groups in x
 tab x1 x2, summ(y) means two way table of x1 and x2 with means of y
 table x1 x2, contents(mean y1 median y2)
 stats(q) gives interquartile range

for tab command:
 ,cell gives % for each cell
 ,expected gives expected distributions
 ,generate(new) plots dummy variables eg. tab x1,gen(dummy) produces dummy1 dummy2 dummy3 for each
 ,lrchi likelihood ratio
 ,missing
 ,nofreq
 ,nolabel

Multi-table frequencies
 table y x2 x3, by(x5 x6) contents (freq)
 by x3, sort: tab x1 x2, exact (This is a 3 way table)

other stats for contents:
 freq,mean,sd,sum,rawsum,count,n,max,min,median,iqr,p1,p99,p75
 use format tab x y, contents(mean z median z)
 tab died yr, summ(pim) means two-way table of means
 tab x , contents(mean z)

Table function with contents
 tab group cs, sum(los) means
 table group cs, contents (count los median los)
 table inhouse yr, contents (count los median los)
 table band2 yr, contents (count los)table age yr, contents (count los)
 table x y, by(z)

Confidence intervals
 ci x, level (99) produces 99%confidence interval once x is summarised
TABSTAT tabstat x,stats(….)

 tabstat pop, stat(mean), by (size)
 tabstat lov_days, by(yr) stat(mean sd min max) nototal long
 tabstat lov_days, by(yr) stat(n q) nototal long
 tabstat lov_days, by(yr) stat(mean sd min max) nototal long
 tabstat lov_days, by(yr) stat(n q) nototal long
 tabstat x, stat(mean, count, median), by(var2)
 tabstat x, stat(count mean q) by(y) q is interquartile range
 by x:tabstat y if z==2 & q==0, summarize(n mean q)
 record(x) is highest value of x (egenmore function)

these are options between brackets:
· mean
· count (count of nonmissing observations)
· n same as count
· sum sum
· max maximum
· min minimum
· range range = max - min
· sd standard deviation
· sdmean standard deviation of mean = sd/sqrt(n)
· skewness skewness
· kurtosis kurtosis
· median median (same as p50)
· p1 1st percentile
· p5 5th percentile
· p50 50th percentile (same as median)
· iqr interquartile range = p75 - p25
· q equivalent to specifying "p25 p50 p75"

OTHER BASIC STATS TESTS

Skewness test
 sktest x
 swilk x (shapiro wilks)
 ladder x (produces powers with skewness test for normality)
 gladder x (plots various distributions of x)
Parametric One sam
ple t-test:
 ttest write=50 (does mean differ from 50)
Non Parametric One sam
ple (
Wilcoxon signed-rank test)
 signrank write=50 (eg. does median differ from 50)
Binomial test bitest female=.5 (eg. does proportion differ from 50%)
Parametric Two independent samples t-test
 ttest x, by(group)
Non Parametric Mann-Whitney test
 ranksum x, by(group)
Parametric Paired t test

 ttest x = y
Non parametirc Paired (Wilcoxon)

 signrank x=y
Parametric One way anova: anova x y
Non Parametric Kruskall Wallace : kwallis x, by(y)
Date stuff
Import dates from excel in dd/mm/yyyy/ hh:mm format (eg 12/03/2009 12:33)
 gen double dt = clock(datevariable,”DMYhm”)
 format dt %tc
 label variable dt “Date”
To convert clock date (dd/mm/yy hh:mm) to dd/mm/yyyy (eg 12/03/2001)
 gen new_date = dofc(dt)

 format new_date %td
To subtract dates with result in hours: (dates in %td format)
 gen double hrs = hours(end_dt – start_dt) *

 gen los_days = los_hr/24 *
 format los_days %9.1f
To generate dd/mm/yyyy date:
 gen birth = date(dob,”DMY”)

 format birth %td
 label variable birth “Date of Birth”
To convert STRING date in long format (12/03/2009 00:00:00) to short format (12/03/2009)
 gen x=substr(date,1,10) ....this keeps first 10 characters of string

 then use date(x,"DMY") to format the date as td format
Script to convert clock date to multiple formats
gen before = cond(hired_on < td(15jun2004), 1, 0) if hired_on < .
drop if admitted_on < tc(15jun2004 12:00:00)
gen date_tc = clock(x,"DMYhm") // format structure is 12/03/2006 12:30

gen date_td =dofc(date_tc) //convert to 12/03/2006 DAY/MONTH/YEAR format
gen date_tm=mofd(date_td) //convert to 2006/03 YR/MONTH format
format date_tc %tc
format date_td %td
format date_tm %tm
label variable date_tc "Date Clock"
label variable date_td "Date"
label variable date_tm "Yr-Month"
gen yr=year(date_td) //year
gen mo=month(date_td) // month
gen day=day(date_td) //day
gen doy=doy(date_td) //day of year
Date stuff (stata 11)

Dates are set from jan 1 , 1960

Format Description
%td daily
%tw weekly
%tm monthly
%tq quarterly
%th half-yearly
%ty yearly

 generate y = date(doa, "DMY")
 format y %td
Date field in "12/03/2009 23:34" format , use clock function (1 unit = 1 millisecond)

gen double dt_clock= clock(datevariable,"DMY hm")
To convert clock format to date format use dofc
gen newdate = dofc(date_clock) ie 12/03/2009 Format x d% or format %tc gives hh:mm as well
Date in "12/03/2009" format, use date function (1 unit = 1 day)

gen y =date(x,"DMY")
 if year is in format 09 instead of 2009, precede "DMY" by century eg "DM20Y"

 if date spans centuries, use (x,"DMY",2020) for 1998 and 2000 (use largest century date)

 generate birthday=mdy(month,day,year)
 generate m=month(birthday)
 generate d=day(birthday)
 generate y=year(birthday)
 dow(x) date of week
 generate weeks = diff/7
 generate months = diff/30.5

 generate years = diff/365.25

Other functions:
 weekly(x,"wy")
 monthly(x,"my")
 quarterly(x,"qy")
 halfyearly(xr,"hy")
 yearly(x,"y")

If three columns for each day,month and year use:
 gen y = mdy(month,day,year)
 gen x = mdy(x,y,z)

mdy(month,day,year) daily
yw(year, week) weekly
ym(year,month) monthly
yq(year,quarter) quarterly
yh(year,half-year) half-yearly

Translating to td% dates (DD/MM/YYYY)
dofw() weekly to daily
dofm() monthly to daily
dofq() quarterly to daily
dofy() yearly to daily

Translate from %td dates:

wofd() daily to weekly

mofd() daily to monthly
qofd() daily to quarterly
yofd() daily to yearly

To reference dates :
 reg x y if w(1999w10)
 sum salary if q(1998-4)
 tab sex if y(2007)
To reference range of dates use the tin() and twithin() functions:

 reg y x if tin(01feb1998,01jun1998)
 sum income if twithin(1990-1,1990-3)
tin() includes the beginning and end dates, twithin() excludes them
Stata: Visual date displays
 %tc | mdyhms(M, D, Y, h, m, s)
 %tc | dhms(td, h, m, s)
 %tc | hms(h, m, s)
 %td | mdy(M, D, Y)
 %tw | yw(Y, W)
 %tm | ym(Y, M)
 %tq | yq(Y, Q)
 %th | yh(Y, H)
 %ty | Y
clock values (%tc) for data in format "12/03/2009 23:34"
format x = hh(x) shows hours. Mm(x) or ss(x)
gen x = mdy(m,d,y) or mdyhs

gen bdayday = day(bdaynew)
gen bdaydow = dow(bdaynew)
gen bdaymo = month(bdaynew)
gen bdayyr = year(bdaynew)
Convert date times (NOTE %tc is milliseconds, %td is seconds, %tm is months)
 %tc to %td use dofc(x) ie from "12/03/2009 23:34" to "12/03/2009"
 %td to %tm use mofd( )
 %td to %tq use qofd( )
 then can apply year( ) month( ) day( ) doy( ) dow( ) this is day of yr from 1-365 or day of week
 halfyear( ) quarter( ) week( ) dow ( )= sunday
Conditional date arguments
 gen before=cond(adm<td(15jun2006),1,0
 list if !inrange(x,2,10) lists if not in range 2 to 10
 list age if inrange(population,200,5000)
 gen byte x = inlist(x,"one","two")
 egen x =rcount(v1-v4),cond(@>5 & @>15)
 by year,sort:egen y=sum(died)
generate bdaynew=date(bday,"mdy", 2010) if data as 02/03/07
Date script
TYPICAL DATE SCRIPT (assume date in string format as dd/mm/yyyy hh:mm:ss) and you w
gen x =substr(date,1,10) //converts string date to dd/mm/yyyy string
drop date
gen date= date(x,"DMY") //converts date to %td format
format date %td
label variable date "Date"
gen yr = year(date)
gen mo = month(date)
gen day=day(date)
gen month_yr =mofd(date) //convert to month yr format
format month_yr %tm
FOR FULL CLOCK FORMAT use (assume date in string format as dd/mm/yyyy hh:mm:s
gen date2 = clock(date,"DMYhms")
format date2 %tc
Graph tips
Histograms
 histogram x, by(group, total) percent bin(10)
 histogram x, frequency title("Graph1) xlabel(15(10)30) ytick (1(2)10) start(10) width(2) norm gap(1
........ start is where bar begins, width id bin size , norm overlies curve, gap
Scatterplots
 graph twoway scatter x y (can use xlabel, xtick, xtitle and also msymbol)
 mysymbol () can be: O,o, D,d, T,t, S,s, +, smplus,X,x, (add "h" for hollow eg Oh = hollow big circl
 scatter twoway x y [fweight = age], by(group) symbol (oh) mlabel(id) allows bubble plot with size f
 to format axis numbers eg 1.33 to 1.3 use: ylabel(,format(%3.1f))
Lineplots
 graph twoway line x y year, legend (label(1 "label a") label(2 "label b") position(2) ring(0) rows(2)
 this plots line x and y against time (year). the legend will be placed in the graph (ring(0) in top right
 can use xtick eg (1960(2) 1980) for every 2 yrs

 ylabel(0(10)50),angle(horizontal)) will plot label for y axis horizontally
 clpattern is type of line and can be: solid,dash,dot,dash_dot,shortdash,shortdash_dot,longdash,blank
 if two lines then specify each in the plot eg msymbol (T Oh) clpattern(dash solid)
Barplots
 Can do summaries eg graph bar (median) x,over(group) blabel(bar,size (medium)) bar(1, bcolor(gs1
 note: blabel puts the value on top of each bar
 bar labels can vary in size: size(small) or tiny medsmall medlarge large
 Stacked bar graph : graph bar(sum) x y z, over(group) stack
 graph hbar for horizontal bar graph
Other graphs:
 qqplot xy
 quantile x
 qnorm x,grid

Mean /SEM type graph:

 graph twoway rcap xlow xhigh year || connect z_mean year, legend (off) if z_mean is mean and ea
xhigh respectivelyy (eg after collapse command).
 Can use ANOVA comand to creat mean/SEM graphs by using "predict" command after ANOVA:
 anova income year

 predict income_mean (generate smean value for income)
 predict SEincome (generates Standrd error for income) then use serrbar scale(2) to plot in
 serrbar income_mean SEincome year, scale(2) addplot (line income_mean year,clpatter
or do following: **from Statistics with Stata by Hamilton

anova income year gender*year
predict aggmean
predict SEagg, stdp
gen agghigh = aggmean +2 * SEagg
gen agglow = aggmean -2 * SEagg
graph twoway connected aggmean year || rcap agghigh agglow year, by (gender, legend(off) note(" ")
Overlapping graphs
 graph twoay lfitci x y || scatter x y , xlabel(10(2)20) ylabel(2(10)20, angle(horizontal)) legend(order
"regression line") rows(3) position(1) ring(0)
Combining graphs
graph twoway x y ............... saving (fig1)

graph twoway z x............... saving (fig2)
graph combine fig1.gph fig2.gph imargin(vsmall) rows(2) rows is numberof rows on graph
Graph time-series
 tsset date_x,format(%td) to set data as daily where date_x is date variable
 tssmooth ma newvar = x, window (2 1 2) generates a 5 day moving average (2 lagged, current v
 graph twoway line admissions date_x plots line graph of admissions vs date but x-axis looks
 graph twoway tsline admissions, ylabel(10(10)100) ttitle(" ") tlabel(01jan1983 01mar1983, grid
clwidth(thin) clpattern(solid)
 using tsline, because data is tsset, one doesn't reference time, surpress the title by ttilte("")
Other way to generate moving average is egen:

egen moving_av = ma(x), nmiss t(3) gives 3 day moving average if data is tsset to daily formats
tssmooth nl command is better if outliers present
 tssmooth nl x_smooth =x, smootehr(4253h,twice) ...smooths the running median by different s

moving avergae of span 3 according to Velleman
Date: gen addate1 = clock(doa,"DMY hm") // for PICU data where date field is
convert date 12/03/2008 12:33
to clock format addate1 %tc
value. Must
be in format *generate date format ie DDMMYYYY from clock value above
4/01/2010 12/03/09 gen addate = dofc(addate1)
19:22:24 12:34 format addate %td Dates
02/02/2010 str2time tod, ge
19:46:28 String to time conversion. http://www.sealedenvelope.com/stata/time.php eg:tod (string) 0
Covert decimal time to HH:MM format (24hrs =1) converts a numeric variable
containing elapsed times to a string variable containing times in 24 hour clock
02/02/2010 format (HH:MM or HH:MM:SS).
19:48:03 http://www.sealedenvelope.com/stata/time.php time2str etod, g
table x y z, cont
table x y z, cont
04/01/2010 stats on multiple tables of variables by group table x y z, cont
21:17:08 Can be count data or any other statistic by(group)
24/01/2010 _n manipuations: Sequentialy count of oservations in subgroups (eg count by yr, sort: gen
subgroup
by yr, sort: gen
19:19:51 from observation1,2,3,4 ... for every year, with each year starting at 1 again subgroup
i = identifier, j =
wide: devides d
eg: i=patient, j=t
reshape wide N
format to use as
eg patient with t
to move betwee
reshape long x,
reshape wide x,
24/01/2010
19:25:04 Reshape data between wide and long formats These steps “un
make file "profile
put it in root of S
24/01/2010 In this do-file pu
19:26:57 Start up do file
hl Hosmer-Leme
reformat conver
publication qual
time utilities to t
format to elapse
xcount count clu
02/02/2010 xfill fill in static v
19:43:21 Web page http://www.sealedenvelope.com/stata.php xtab tabulate lon
02/02/2010 web
20:01:20 http://www.cpc.unc.edu/research/tools/data_analysis/statatutorial/index.html University Caro
label v1mean "m
predict SEv, std
predicted mean
gen vhigh =v1m
gen vlow =v1me
graph twoway c
Plot ANOVA graph rcap vhigh vlow
24/01/2010 anova v1 gender year gender*year ")) ytitle("mean v
20:16:13 predict vmean
anova drink yea
predict drinkmea
label variable dr
predict SEdrink,
means
serrbar drinkme
24/01/2010 addplot(line drin
20:16:45 Plot ANOVA graphs with error bars (off)
24/01/2010 Plot ANOVA graphs across time if data in long fo
20:18:37 xtline info: http://www.ats.ucla.edu/stat/stata/faq/xtline.htm (id time y as var
http://www.ats.ucla.edu/stat/stata/faq/visualize_longitudinal.htm xtline y, t(time) i
if data in wide fo
variable eg id tim
profileplot time1
info:
http://www.ats.u
graph bar (sum)
graph bar (sum)
graph bar (sum)
Can use mean,

(or any function
xline(5) or y line
graph bar (med

bar(1,bcolor(gs1
blabel puts num
greyscale
can size up labe
For second bar
24/01/2010 start :
20:20:05 Bar graphs by Group Graph bar x y,
beamplot x, by(g
catplot bar rep7
24/01/2010 catplot hbar rep
20:21:25 CATPLOT, BEAMPLOT graphs
graph hbox x,ov
intensity(30) ma
24/01/2010 intensity shadin

20:22:00 Boxplot Xline in hbox ar
error bar plot:
sort group
gen high = dep
gen low = dep -
twoway (rarea lo
(connected dep
26/01/2010 How to "collapse" data and then plot summary values: by(group) legen
16:50:04 collapse (mean) dep (sd) sddep=dep (count) n=dep, by(visit group) depression"))
26/01/2010
17:01:01 How to refer to scalar in a graph title (CUSUM G
eretrun gives r2
reg x y
26/01/2010 local r2: display
17:05:10 How to show r2 regression result in title of regression graph twoway (lfitci x y
histogram x, fre
ylabel(0(2)10) ti
options:
bin(3)
percent
gap(5)
addlabel. Adds
26/01/2010 discrete onebar
17:08:32 Histograms norm. Overlies n
26/01/2010 if hisrogram of t
17:09:09 Histogram by group by(group,total):H
26/01/2010 Histogram and boxplot on same graph.
17:09:56 Install histbox.ado
02/02/2010
20:12:02 Add linear regression line to scatterplot
12/02/2010
16:07:03 graph descriptive data with one comand: sixplot module
21/02/2010 ERROR BAR GRAPH: Draw graph with time on x-axis and mean/SE error bars or median and IQ
08:45:17 Needs module XTGRAPH. USEFULL
21/02/2010
10:28:39 Stacked bar graph
26/01/2010
17:11:07 Fill in missing gaps in variable in bulk
24/01/2010
19:30:11 Statistics by group (eg median, means, IQR)
26/01/2010
16:48:30 Decode all missing values to a fixed number (eg 999)
26/01/2010
16:53:22 Generate random numbers and mix up the numbers randomly after this
26/01/2010
16:55:59 Removing duplicated observations in an entire file
26/01/2010
17:00:18 Counting by groups
02/02/2010 Count cluster data. XCOUNT

19:52:45 net from http://www.sealedenvelope.com/
02/02/2010
19:59:04 tab at cluster level (for panel data)
12/02/2010
16:24:34 subtract the previous value in a running calcualtion
12/02/2010 bysort command for grouped stats on a variable.
16:27:40 Does not need sorting before the command is run
12/02/2010
16:30:06 Sorting orders within groups (_N _n values)
12/02/2010
16:40:40 table of means/SD/count across groups: use oneway test
12/02/2010
16:51:36 Calculate difference in 1 value from next or sum of next value
12/02/2010
17:02:43 Counting variables by group
13/02/2010
16:09:00 Tabulate ranges of each quartile for avariable using "xtile" to split data in quartiles, and tabstat to
13/02/2010 Create a list with descriptive stats (mean, n, median etc) for a list of items in a group using EGEN c
16:26:08 which is percentile, sd, rank. USEFULL
13/02/2010
16:27:28 Setting highest value (record value) in a variable
12/02/2010
16:03:16 web: new modules from
24/01/2010
19:56:44 Goodness of fit
24/01/2010
19:57:10 T tests
24/01/2010
19:58:06 Mann Whitney Kruskall Wallace
26/01/2010
16:47:33 Adding formating to decimal places after the tab command for basic stats
26/01/2010
16:51:22 How to tabulate summary values(eg median) by group
26/01/2010 Odds ratio with Fishers exact test, knowing the numbers per group
16:52:37 eg: 21 and 16 vs 1 and 4
26/01/2010
17:11:49 Diagnostic tests (specificity and sensitivity and predictive values)
12/02/2010
16:48:53 Chi squared with known proportions or fischers exact
13/02/2010
16:04:55 Grouped table of stats in multiple rows using tabstat (medians, means etc) VERY USEFULL
16/02/2010 Generate parameter data (intercept, slope, standard erros for both) for clustered longitudinal data (
21:28:59 Generates parameters for each cluster (id).
19/02/2010 XCONTRACT: great way to generate count/percentage , cumulative percentage stats on grouped d
13:09:05 Need to download module first from SSC
04/01/2010
21:12:56 Convert string to numeric
23/01/2010
23:11:48 cumulative count by variable (_n)
Make dummy variables from a categorical variable. eg
24/01/2010 variable "size" has size =0, size =1, size =2
10:29:00 Useful in regression
24/01/2010
19:18:32 List and define missing variables: use "codebook"
24/01/2010 Conditional argument for string:
19:23:30 if variable is string eg "alive"or dead" then convert to 1,0
24/01/2010
19:54:37 Tabstat (for statistics by groups)
26/01/2010
16:57:06 Split characters off a string in different positions
26/01/2010
16:59:27 Split the first word of a string before a delimiter eg. "John:Smith" to John where delimter is ":". De
26/01/2010
17:03:55 Summarise missing values for variables
02/02/2010
20:16:06 One to one merging (if each uniqueid has a matched uniqueid in another file
03/02/2010
14:56:41 Conditional arguments for strings
12/02/2010
15:57:11 Saving log and commands panel
12/02/2010
16:10:22 Replace contents of a string. FDTA module: fdta checks the string varibles, searching for st1. Wh
12/02/2010
16:36:06 Coding variables into categorical groups
12/02/2010
16:38:51 Quick stats (T tests, Mann whitney etc)
12/02/2010
16:43:55 Recode continuous variable into defined categories with cut command
12/02/2010
16:44:32 cumulative sum
12/02/2010
16:46:45 replace variable with previous value if a value is missing
12/02/2010
16:56:01 Missing value in if command
12/02/2010
16:58:32 Find variables in list
12/02/2010
17:00:08 recode variable
12/02/2010
17:01:02 cumulative sum by group
12/02/2010
17:07:03 number of distinct observations by group
13/02/2010
16:30:30 Separating lines in list
13/02/2010
16:33:39 List of TOTAL values by a group using EGEN command. USEFULL
13/02/2010
16:35:46 Current dates and times
21/02/2010
09:56:41 Generate a numerical sequence in an empty variable eg. 0,5,10...
21/02/2010
09:58:56 Convert a string to a number
21/02/2010
14:59:00 MISSING VALUES: Generate a variable to define if a list of variables have missing values
GROUP COMMAND
25/02/2010 Group string or continuous data into distict groups (integers 1,2,3 etc)
11:17:15 1st distinct value coded as 1, next as 2 and so on
25/02/2010
11:19:23 Number lists examples
25/02/2010
15:37:55 REPLACE MISSING value with the preceding value
25/02/2010
15:47:03 LAG value by previous or next value
25/02/2010
15:55:18 RECODE
25/02/2010
17:17:04 Referencing SCALARS in variables or in graphs titles
27/02/2010
23:11:58 DIAGNOSTIC TESTS: specificity, sensitivity, predictive values
28/02/2010
23:45:33 Bulk RENAMING VARIABLES: renvars module
03/03/2010
20:38:31 STATSBY: produce tables of discriptive stats. RE-writes data into new dta tables
04/03/2010
21:49:11
Fill a sequence (use egen)

Convert TIME in HH:MM (eg 12:33) to minutes
07/03/2010
22:29:31
13/04/2010 COunt unique distinct value sin s astring where multiple duplicates may occur. Eg each patient has
16:16:25 admitte dmany times)
15/04/2010 Create variable containing the median length of stay for each diagnostic code
15:12:36 .
15/04/2010
17:19:29 stats for multiple groups using SORT
21/04/2010 Profile.do template file
21:49:19
21/04/2010
22:01:32 Moving average
21/04/2010
22:21:17 Breaking a categorical variable into a set of binary variables (make each categorical value a sepera
collapse (mean) dep (sd) sdd
error bar plot of this:
sort group
gen high = dep + 2*sddep/sq
gen low = dep - 2*sddep/sqrt
twoway (rarea low high visit,
21/04/2010 */ (connected dep visit, mcolo
22:23:36 ERROR BAR AREA GRAPH after collapsing data to basic stats */ clcolor(black)), by(group) /*
29/04/2010 GROUP command to combine a numbe rof variables into a unique
13:24:14 group egen newid = group( var1 va
29/04/2010 Egen faminc =sum(income),b
13:25:46 Grouping code Egen faminc =max(income),b
twoway (line beds_AM day)(l
03/05/2010 Pyramid type plot for day vs night shifts. Shows how to change ylabel(0(5)30, angle(horizont
22:50:13 graph axis label size legend(label(1 Day) label(2 N
by patientid:gen x = _n
19/05/2010 _N count for only maximal value of a grouped variable list egen y = max(x), by patientid
15:06:29 (eg making avariable with maximal value per year per patient) gen var1 = yr if x = y
15/06/2010 sparl x y
19:14:49 Scatterplot with regression line and r2 coeficiant (SPARLmodule regress
08/07/2010
00:18:17 eperiod module TIME DIFFERENC calculatedifference between
Centering data
Return lists:
following the summarize command, the following list returns are available: r(N), r(sum), r(mean),r(sd),r(min
To standardise values:
summarise age
gen age_c = (age -r(mean))/r(std)
Foreach loop to centre data: in this example the variables los and los1 are centered creating los_c and los1_
foreach var of varlist los los1 {

summarize `var', meanonly
generate `var'_c = `var' - r(mean)
label variable `var'_c "`var' (centered)"
}
Loops
Basic loops that can be modified to perform many tasks
foreach variable in v1 v2 v3 {

list `variable´
}
foreach number of numlist 1 2 3 {

display `number´
}
forvalues i=1/3 {
display ì´
}
The same output can also be produced using while as follows:

local i = 1
while i<=3 {
disp ì´
local i = ì´ + 1
}
Looping through a variable list

local xlist na k cl
foreach v of varlist `xlist' {
summ `v'
correlate sbe `v'
scatter sbe `v'
}
Looping through a variable for duplicates

foreach var of varlist dupl_tag{
sort dupl day h_hour h_ph sample
by dupl: gen x=h_na[_n-1]- h_na
drop if x<0.05 & dupl_tag ==`var'
drop x
}

String Functions: Extract 1st Word From String "Name"

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

String Functions: Extract 1st Word From String "Name"

Transféré par

Droits d'auteur :

Formats disponibles

String functions

Search for a word in a string:

Covert String to number

Extract date (12/03/2009) from a long date format (12/03/2009 00:00:00)

Stats for all groups of categorical variable

 table x y, contents(freq mean var1) ................................stats on two categorical variables (x and y)

Stats on multiple variables grouped using by command

 tabstat x if y==1, by(year) stats (median min max)

How to make stats per quartile of a variable

Populate a new variable with stats by a grouped categorical variable

 bysort yr:egen x = rank(lov)

Make a new dataset of descriptive stats using collapse function

 return values that can be used include mean,median,p50,p75,p25,sd,semean,sum,count,rawsum,min,m

Cumulative sum by group

Highest value record

Count distinct values in a variable

bysort x y: generate count = (_n==1)

egen tag = tag (x y)

or egen count=nvals(x), by(y)

 bysort yr:tabstat bedays,by(mo) statistics(median, p25 p75)

 tab x, contents(mean variableX sd variableX count variableX)

 by subgrp, sort: egen medstay = median(los)

GENERATE the deviation from the median length of stay

 generate deltalos = los - medstay

SUBGROUP AGGREGATES (eg by month)

 by mnth, sort: egen monthmedn= median(daymax)

 by mnth, sort: egen monthmin= min(daymax)

 by mnth, sort: egen month25= pctile(daymax),p(25)

 by mnth, sort: egen month75= pctile(daymax),p(75)

 tab x y, chi2 for chi squared (use exact for fishers)

Fishers or CHI squared

Cases | 12 23 | 35 0.3429

Controls | 24 56 | 80 0.3000

Total | 36 79 | 115 0.3130

| Point estimate | [95% Conf. Interval]

Odds ratio | 1.217391 | .4710021 3.04953 (exact)

Attr. frac. ex. | .1785714 | -1.123133 .6720806 (exact)

Attr. frac. pop | .0612245 |

1-sided Fisher's exact P = 0.4023

2-sided Fisher's exact P = 0.6669

Showing means and discriptive stats in tables

 ladder x (produces powers with skewness test for normality)

 gladder x (plots various distributions of x)

ple t-test:

 ttest write=50 (does mean differ from 50)

Non Parametric One sam

Wilcoxon signed-rank test)

 signrank write=50 (eg. does median differ from 50)

Parametric Two independent samples t-test

Parametric Paired t test

Non parametirc Paired (Wilcoxon)

Parametric One way anova: anova x y

Non Parametric Kruskall Wallace : kwallis x, by(y)

 label variable dt “Date”

To convert clock date (dd/mm/yy hh:mm) to dd/mm/yyyy (eg 12/03/2001)

 gen new_date = dofc(dt)

To subtract dates with result in hours: (dates in %td format)

 gen double hrs = hours(end_dt – start_dt) *

 format los_days %9.1f

To generate dd/mm/yyyy date:

 gen birth = date(dob,”DMY”)

 label variable birth “Date of Birth”

 gen x=substr(date,1,10) ....this keeps first 10 characters of string