Académique Documents
Professionnel Documents
Culture Documents
Extract dates from strings
regexs(0) if regexm(date,"[0-9]*$") extracts year from a stata date field (%td format 23jan2009)
regexs(0) if regexm(date,"[a-zA-Z]+" extracts month from 23jan2009
regexs(0) if regexm(date,"^[0-9]+") extracts day from 23jan2009
Split a string into many variables according to defined split point eg: "Diaganosis:cardiac" to "Diagnosis" "cardia
split var1,p(:) limit(2) splits string "var1" at colon ":" in the string for 2 occasions (defined by "limit")
Extract letters from string
substr(var1,1) chooses first letter of string
substr(var1,1,2) chooses first 2 letters of string
substr(var1,-1) chooses last letter of string
substring (name,1,comma-1)…extracts from name to first comma
substr("abcdef",2,3) gives "bcd"
substr("abcdef",-3,2) gives "de"
substr("abcdef",2,.) gives "bcdef" substr("abcdef",-3,.) = "def"
Isolate parts of string
gen x = strpos(diagnosis,"RSV") >0 codes 1 for any time the string "RSV" appears in the field (if it appear
regexm(var1,"RSV") codes 1 if word "RSV" appears in var1 string
Join strings together (needs egenmore.ado) Concatenate
egen initials = concat(x1 x2)
egen newvar = concat (x y z), punct(-) ..creates x-y-z, use any punctaution character in brackets of punct()
Replace
replace myvar = myvar[ n-1] if missing(myvar) ....replace missing value with previous value
replace x= round(x,1)......round a value to 1 decimal place
replace x = 1 if _n<= 100 ...replace first 100 values
replace x = 1 if (_N - _n) < 100....replace last 100 values
replace x = 1 in 4 ....replaces 4th value in x with 1
replace x=1 in 5/25 ....replaces from value 5 to 25
Stats: Descriptive
summarise var1.........this gives mean
summarise var1,detail (this gives full breakdown of stats such as mean, SD, min, mx etc)
values are stored and can be retrieved with return command eg r(mean), r(p75), r(p25), r(sd) SEM = r(
to see return lists type return list
statsby mean=r(mean) sem=r(sem) size=r(N), by(group) gives grouped stats
Other method:
summarise x
summarise,detail x (this gives details such as percentiles)
Stats by group:
Frequency distributions
tab x y
tab x y, row gives % for row (can also use column)
-----------------+------------------------+------------------------
-----------------+------------------------+------------------------
| |
|------------------------+------------------------
+-------------------------------------------------
diagti 80 17 11 44
Parametric One sam
ple (
Binomial test bitest female=.5 (eg. does proportion differ from 50%)
ttest x, by(group)
Non Parametric Mann-Whitney test
ranksum x, by(group)
Date stuff
Import dates from excel in dd/mm/yyyy/ hh:mm format (eg 12/03/2009 12:33)
gen double dt = clock(datevariable,”DMYhm”)
format dt %tc
To convert STRING date in long format (12/03/2009 00:00:00) to short format (12/03/2009)
Format Description
%td daily
%tw weekly
%tm monthly
%tq quarterly
%th half-yearly
%ty yearly
generate y = date(doa, "DMY")
format y %td
generate d=day(birthday)
generate y=year(birthday)
dow(x) date of week
Other functions:
weekly(x,"wy")
monthly(x,"my")
quarterly(x,"qy")
halfyearly(xr,"hy")
yearly(x,"y")
If three columns for each day,month and year use:
gen y = mdy(month,day,year)
gen x = mdy(x,y,z)
mdy(month,day,year) daily
ym(year,month) monthly
yq(year,quarter) quarterly
yh(year,half-year) half-yearly
Translating to td% dates (DD/MM/YYYY)
Translate from %td dates:
To reference dates :
reg x y if w(1999w10)
sum salary if q(1998-4)
%tc | mdyhms(M, D, Y, h, m, s)
%tc | dhms(td, h, m, s)
%tc | hms(h, m, s)
%td | mdy(M, D, Y)
%tw | yw(Y, W)
%tm | ym(Y, M)
%tq | yq(Y, Q)
%th | yh(Y, H)
%ty | Y
%td to %tm use mofd( )
%td to %tq use qofd( )
then can apply year( ) month( ) day( ) doy( ) dow( ) this is day of yr from 1-365 or day of week
gen before=cond(adm<td(15jun2006),1,0
by year,sort:egen y=sum(died)
Date script
TYPICAL DATE SCRIPT (assume date in string format as dd/mm/yyyy hh:mm:ss) and you w
drop date
gen date= date(x,"DMY") //converts date to %td format
gen yr = year(date)
gen mo = month(date)
gen day=day(date)
FOR FULL CLOCK FORMAT use (assume date in string format as dd/mm/yyyy hh:mm:s
Graph tips
Histograms
histogram x, by(group, total) percent bin(10)
histogram x, frequency title("Graph1) xlabel(15(10)30) ytick (1(2)10) start(10) width(2) norm gap(1
........ start is where bar begins, width id bin size , norm overlies curve, gap
Scatterplots
graph twoway scatter x y (can use xlabel, xtick, xtitle and also msymbol)
mysymbol () can be: O,o, D,d, T,t, S,s, +, smplus,X,x, (add "h" for hollow eg Oh = hollow big circl
scatter twoway x y [fweight = age], by(group) symbol (oh) mlabel(id) allows bubble plot with size f
to format axis numbers eg 1.33 to 1.3 use: ylabel(,format(%3.1f))
Lineplots
graph twoway line x y year, legend (label(1 "label a") label(2 "label b") position(2) ring(0) rows(2)
this plots line x and y against time (year). the legend will be placed in the graph (ring(0) in top right
Barplots
Can do summaries eg graph bar (median) x,over(group) blabel(bar,size (medium)) bar(1, bcolor(gs1
note: blabel puts the value on top of each bar
bar labels can vary in size: size(small) or tiny medsmall medlarge large
Stacked bar graph : graph bar(sum) x y z, over(group) stack
graph hbar for horizontal bar graph
Other graphs:
qqplot xy
quantile x
qnorm x,grid
Overlapping graphs
graph twoay lfitci x y || scatter x y , xlabel(10(2)20) ylabel(2(10)20, angle(horizontal)) legend(order
"regression line") rows(3) position(1) ring(0)
Combining graphs
Graph time-series
tsset date_x,format(%td) to set data as daily where date_x is date variable
tssmooth ma newvar = x, window (2 1 2) generates a 5 day moving average (2 lagged, current v
graph twoway line admissions date_x plots line graph of admissions vs date but x-axis looks
graph twoway tsline admissions, ylabel(10(10)100) ttitle(" ") tlabel(01jan1983 01mar1983, grid
clwidth(thin) clpattern(solid)
using tsline, because data is tsset, one doesn't reference time, surpress the title by ttilte("")
wide: devides d
reshape wide N
format to use as
eg patient with t
to move betwee
reshape long x,
reshape wide x,
24/01/2010
19:25:04 Reshape data between wide and long formats These steps “un
make file "profile
put it in root of S
24/01/2010 In this do-file pu
19:26:57 Start up do file
hl Hosmer-Leme
reformat conver
publication qual
time utilities to t
format to elapse
xcount count clu
02/02/2010 xfill fill in static v
19:43:21 Web page http://www.sealedenvelope.com/stata.php xtab tabulate lon
02/02/2010 web
20:01:20 http://www.cpc.unc.edu/research/tools/data_analysis/statatutorial/index.html University Caro
label v1mean "m
predict SEv, std
predicted mean
gen vhigh =v1m
gen vlow =v1me
graph twoway c
Plot ANOVA graph rcap vhigh vlow
24/01/2010 anova v1 gender year gender*year ")) ytitle("mean v
20:16:13 predict vmean
anova drink yea
predict drinkmea
label variable dr
predict SEdrink,
means
serrbar drinkme
24/01/2010 addplot(line drin
20:16:45 Plot ANOVA graphs with error bars (off)
24/01/2010 Plot ANOVA graphs across time if data in long fo
20:18:37 xtline info: http://www.ats.ucla.edu/stat/stata/faq/xtline.htm (id time y as var
http://www.ats.ucla.edu/stat/stata/faq/visualize_longitudinal.htm xtline y, t(time) i
if data in wide fo
variable eg id tim
profileplot time1
info:
http://www.ats.u
graph bar (sum)
graph bar (sum)
graph bar (sum)
21/02/2010 ERROR BAR GRAPH: Draw graph with time on x-axis and mean/SE error bars or median and IQ
08:45:17 Needs module XTGRAPH. USEFULL
21/02/2010
10:28:39 Stacked bar graph
26/01/2010
17:11:07 Fill in missing gaps in variable in bulk
24/01/2010
19:30:11 Statistics by group (eg median, means, IQR)
26/01/2010
16:48:30 Decode all missing values to a fixed number (eg 999)
26/01/2010
16:53:22 Generate random numbers and mix up the numbers randomly after this
26/01/2010
16:55:59 Removing duplicated observations in an entire file
26/01/2010
17:00:18 Counting by groups
12/02/2010
16:30:06 Sorting orders within groups (_N _n values)
12/02/2010
16:40:40 table of means/SD/count across groups: use oneway test
12/02/2010
16:51:36 Calculate difference in 1 value from next or sum of next value
12/02/2010
17:02:43 Counting variables by group
13/02/2010
16:09:00 Tabulate ranges of each quartile for avariable using "xtile" to split data in quartiles, and tabstat to
13/02/2010 Create a list with descriptive stats (mean, n, median etc) for a list of items in a group using EGEN c
16:26:08 which is percentile, sd, rank. USEFULL
13/02/2010
16:27:28 Setting highest value (record value) in a variable
12/02/2010
16:03:16 web: new modules from
24/01/2010
19:56:44 Goodness of fit
24/01/2010
19:57:10 T tests
24/01/2010
19:58:06 Mann Whitney Kruskall Wallace
26/01/2010
16:47:33 Adding formating to decimal places after the tab command for basic stats
26/01/2010
16:51:22 How to tabulate summary values(eg median) by group
26/01/2010 Odds ratio with Fishers exact test, knowing the numbers per group
16:52:37 eg: 21 and 16 vs 1 and 4
26/01/2010
17:11:49 Diagnostic tests (specificity and sensitivity and predictive values)
12/02/2010
16:48:53 Chi squared with known proportions or fischers exact
13/02/2010
16:04:55 Grouped table of stats in multiple rows using tabstat (medians, means etc) VERY USEFULL
16/02/2010 Generate parameter data (intercept, slope, standard erros for both) for clustered longitudinal data (
21:28:59 Generates parameters for each cluster (id).
19/02/2010 XCONTRACT: great way to generate count/percentage , cumulative percentage stats on grouped d
13:09:05 Need to download module first from SSC
04/01/2010
21:12:56 Convert string to numeric
23/01/2010
23:11:48 cumulative count by variable (_n)
Make dummy variables from a categorical variable. eg
24/01/2010 variable "size" has size =0, size =1, size =2
10:29:00 Useful in regression
24/01/2010
19:18:32 List and define missing variables: use "codebook"
24/01/2010 Conditional argument for string:
19:23:30 if variable is string eg "alive"or dead" then convert to 1,0
24/01/2010
19:54:37 Tabstat (for statistics by groups)
26/01/2010
16:57:06 Split characters off a string in different positions
26/01/2010
16:59:27 Split the first word of a string before a delimiter eg. "John:Smith" to John where delimter is ":". De
26/01/2010
17:03:55 Summarise missing values for variables
02/02/2010
20:16:06 One to one merging (if each uniqueid has a matched uniqueid in another file
03/02/2010
14:56:41 Conditional arguments for strings
12/02/2010
15:57:11 Saving log and commands panel
12/02/2010
16:10:22 Replace contents of a string. FDTA module: fdta checks the string varibles, searching for st1. Wh
12/02/2010
16:36:06 Coding variables into categorical groups
12/02/2010
16:38:51 Quick stats (T tests, Mann whitney etc)
12/02/2010
16:43:55 Recode continuous variable into defined categories with cut command
12/02/2010
16:44:32 cumulative sum
12/02/2010
16:46:45 replace variable with previous value if a value is missing
12/02/2010
16:56:01 Missing value in if command
12/02/2010
16:58:32 Find variables in list
12/02/2010
17:00:08 recode variable
12/02/2010
17:01:02 cumulative sum by group
12/02/2010
17:07:03 number of distinct observations by group
13/02/2010
16:30:30 Separating lines in list
13/02/2010
16:33:39 List of TOTAL values by a group using EGEN command. USEFULL
13/02/2010
16:35:46 Current dates and times
21/02/2010
09:56:41 Generate a numerical sequence in an empty variable eg. 0,5,10...
21/02/2010
09:58:56 Convert a string to a number
21/02/2010
14:59:00 MISSING VALUES: Generate a variable to define if a list of variables have missing values
GROUP COMMAND
25/02/2010 Group string or continuous data into distict groups (integers 1,2,3 etc)
11:17:15 1st distinct value coded as 1, next as 2 and so on
25/02/2010
11:19:23 Number lists examples
25/02/2010
15:37:55 REPLACE MISSING value with the preceding value
25/02/2010
15:47:03 LAG value by previous or next value
25/02/2010
15:55:18 RECODE
25/02/2010
17:17:04 Referencing SCALARS in variables or in graphs titles
27/02/2010
23:11:58 DIAGNOSTIC TESTS: specificity, sensitivity, predictive values
28/02/2010
23:45:33 Bulk RENAMING VARIABLES: renvars module
03/03/2010
20:38:31 STATSBY: produce tables of discriptive stats. RE-writes data into new dta tables
04/03/2010
21:49:11
07/03/2010
22:29:31
13/04/2010 COunt unique distinct value sin s astring where multiple duplicates may occur. Eg each patient has
16:16:25 admitte dmany times)
15/04/2010 Create variable containing the median length of stay for each diagnostic code
15:12:36 .
15/04/2010
17:19:29 stats for multiple groups using SORT
21/04/2010 Profile.do template file
21:49:19
21/04/2010
22:01:32 Moving average
21/04/2010
22:21:17 Breaking a categorical variable into a set of binary variables (make each categorical value a sepera
collapse (mean) dep (sd) sdd
error bar plot of this:
sort group
gen high = dep + 2*sddep/sq
gen low = dep - 2*sddep/sqrt
twoway (rarea low high visit,
21/04/2010 */ (connected dep visit, mcolo
22:23:36 ERROR BAR AREA GRAPH after collapsing data to basic stats */ clcolor(black)), by(group) /*
29/04/2010 GROUP command to combine a numbe rof variables into a unique
13:24:14 group egen newid = group( var1 va
29/04/2010 Egen faminc =sum(income),b
13:25:46 Grouping code Egen faminc =max(income),b
twoway (line beds_AM day)(l
03/05/2010 Pyramid type plot for day vs night shifts. Shows how to change ylabel(0(5)30, angle(horizont
22:50:13 graph axis label size legend(label(1 Day) label(2 N
by patientid:gen x = _n
19/05/2010 _N count for only maximal value of a grouped variable list egen y = max(x), by patientid
15:06:29 (eg making avariable with maximal value per year per patient) gen var1 = yr if x = y
15/06/2010 sparl x y
19:14:49 Scatterplot with regression line and r2 coeficiant (SPARLmodule regress
08/07/2010
00:18:17 eperiod module TIME DIFFERENC calculatedifference between
Centering data
Return lists:
following the summarize command, the following list returns are available: r(N), r(sum), r(mean),r(sd),r(min
To standardise values:
summarise age
gen age_c = (age -r(mean))/r(std)
Foreach loop to centre data: in this example the variables los and los1 are centered creating los_c and los1_
Loops
Basic loops that can be modified to perform many tasks
forvalues i=1/3 {
display `i´
}