Vous êtes sur la page 1sur 4


capture log close

use cstat.dta, clear
log using cstat.log, replace


summ datadate, format

tab indfmt

destring gvkey, replace

duplicates tag gvkey datadate, gen(dup)

tab dup

drop if dup==1

/* There are 73,704 obs with one duplicate based on gvkey and datadate,
so we can get rid of 73,704/2= 36,852 obs */
order gvkey datadate dup

/* Identify and keep all obs for gvkeys that have a duplicate. */
gen keeper=0
levelsof gvkey if dup==1, local(levels)
foreach x of local levels {
replace keeper=1 if `x' == gvkey

/* Tabulate dup before and after dropping obs without duplicates

as a logic check. */
tab dup
keep if keeper==1
tab dup

/* Duplicates are mostly for INDFMT's equal to INDL and FS for the
same firm. These are two different reporting formats that, per
COMPUSTAT, allow firms that are non-financial services to also report
in a finacial services format. Per an eyeball inspection of the data,
most of the FS line items are mostly missing values. The goal is to
keep whichever obs has the least missing values */

egen mc= rowmiss(bkvlps- prcc_f)

/* intermediate step to create a variable equal to the smallest missing

value count within each gvkey and datadate */

bys gvkey datadate: egen minmc= min(mc)

gen del= 0
replace del=1 if dup != 0 & gvkey[_n]== gvkey[_n+1] & minmc != mc | ///
dup !=0 & gvkey[_n]== gvkey[_n-1] & minmc != mc

/* first part of OR statement above will tag all obs unless the last
one for the firm is the one that should be deleted. The code after the OR
operator takes care of marking the last obs for deletion within the firm if
that is one that should be deleted */

order gvkey datadate dup mc minmc del

tab indfmt if del==1

list gvkey sic if del==1 & indfmt=="INDL"

/* Per OSHA SIC codes, SIC codes 60-67 are Finance, Insurance,
and Real Estate. Makes sense that these firms have more missing values for
INDL than FS in the indfmt variable. */

destring sic, replace

count if sic > 5999 & sic < 6800

/* 137,605 Division H, Finance, Insurancec, Real Estate firms */

tab indfmt if sic >5999 & sic < 6800

/* but looks like most report their results using INDL and not FS */

count if del==1 & indfmt=="INDL"

drop if del==1

/* Above we dropped 35,927 of the 36,852 duplicates. There are 925

remaining duplicates based on gvkey and datadate based on our
previous calculation. */

duplicates tag gvkey datadate, gen(dup2)

tab dup2

di 1850/2

/* Above is consistent with our expectation. */

gen tie=0
bys gvkey datadate: replace tie=1 if mc[_n]== mc[_n+1]
/* within gvkey and datadate's, identify ties of missing value sums. This will
tag one of the obs, and here we don't care which we retain since they have
the same number of missing values (unless some are more important than others,
but no reason to think that in this case), so we can drop either one.
Note that this only marks one for deletion and not both */
order gvkey datadate dup dup2 mc minmc del tie

drop dup2
drop if tie==1

/* Now that duplicate datadates have been taken care of, create
a variable equal to the year component of datadate so that an
annual xtset may be done */
gen year= year(datadate)
duplicates tag gvkey year, gen(ydup)
tab ydup

/* While gkvey datadate is now unique, gvkey year is not */

keep if ydup==1
/* An eyeball inspection shows that these duplicates are for firms
who changed their financial reporting period year-ends. */

list gvkey datadate year fyear in 1/10

/* Note that fyear has been defined by COMPUSTAT to attempt to

compensate for this behavior */



count if fyear==.
/* The 723 missing values above will trigger an error if we try to xtset
based on gvkey and fyear now. We therefore replace missing fyear based
on the COMPUSTAT data definition */
replace fyear= year(datadate)-1 if fyear== . & month(datadate) <=5
replace fyear= year(datadate) if fyear==. & month(datadate) >5

di 629 + 94

/* All 723 missing fyears have been taken care of above. Next, we check
for duplicates */

duplicates tag gvkey fyear, gen(dup4)

tab dup4

/* There are 36/2= 18 duplicates. This is explored below */

list gvkey datadate fyear if dup4==1
/* These appear to be firms that switched financial reporting periods
but ended up with the same fyear based on the old and new year-ends.
Eyeball inspection shows that there are many missing values for one each of
the years associated in each of these obs. We will drop whichever
has the most missing values using a modification of the minmc code
used previously. */

/* The original minmc code needs to be modified since minmc was originally
caluclated based on two observations for firms with the same DATADATE.
Here, these firms switched financial reporting period-ends, but not to the
extent that the FYEAR variable changed. Need to create a variable equal to the
year component for each obs and re-calculate minmc. */

drop minmc
bys gvkey fyear: egen minmc= min(mc)
order gvkey datadate fyear dup mc minmc del tie
replace del=1 if dup4==1 & gvkey[_n]== gvkey[_n+1] & minmc != mc | ///
dup4 ==1 & gvkey[_n]== gvkey[_n-1] & minmc != mc
drop if del==1

/* similar to before, we also need to deal with any potential ties */

bys gvkey fyear: replace tie=1 if mc[_n]== mc[_n+1]

drop if tie==1

xtset gvkey fyear, yearly


/* Reconciliation of starting to ending obs:

433,666- 35,927 - 925 - 18= 396,796.

Altogether, we dropped 36,870 obs rather than 73,704 had we simply:


drop if dup==1


at the beginning of the .do file.


save c:\econometrics\scaling_monte_carlo\cstat\cstat_cleaned.dta, replace

log close