Vous êtes sur la page 1sur 8

NESUG 2006

Coders' Corner

Identifying Duplicate Values


Christopher J. Bost, MDRC, New York, NY

ABSTRACT
This paper describes techniques for identifying duplicate values across observations, including PROC SORT, PROC FREQ, PROC SQL, and FIRST./LAST. variables. Pros and cons of each technique are noted.

INTRODUCTION
Duplicate values may or may not be a problem, depending on the data source. Four techniques to identify duplicate values are detailed below. Each is followed by an optional technique to store observations with duplicate values in another SAS data set.

SAMPLE DATA SET


Temporary SAS data set TEST is used in all examples. It contains two variables and ten observations:

Data set TEST Obs 1 2 3 4 5 6 7 8 9 10 ID 111 222 333 444 444 555 555 666 666 666 N 1 1 1 1 2 1 2 1 2 3

Note that IDs 111, 222, and 333 each occur once. IDs 444 and 555 each occur twice. ID 666 occurs three times. In other words, there are three single values, two duplicate values, and one triplicate value. Variable N numbers the relative occurrence of each ID.

IDENTIFYING DUPLICATES WITH PROC SORT


Use PROC SORT to remove duplicate values: proc sort data=test nodupkey; by id; run; Observations in data set TEST are sorted by ID in ascending order. The NODUPKEY option deletes any observations with duplicate BY values (i.e., observations that duplicate a previously encountered value of ID). More than one variable may be specified on the BY statement; SAS will delete any observations that duplicate previously encountered combinations of BY values. Use the NODUPREC option to delete duplicate observations (i.e., where all variable values are repeated).

NESUG 2006

Coders' Corner

STORING DUPLICATES Use the DUPOUT= option with NODUPKEY (or NODUPREC) to output duplicates to the specified SAS data set: proc sort data=test nodupkey dupout=dups; by id; run; Observations in data set TEST are sorted by ID in ascending order. The NODUPKEY option deletes any observations with duplicate BY values from data set TEST. The DUPOUT= option outputs observations with duplicate BY values to data set DUPS. The resulting data sets look as follows: Data set TEST [NODUPKEY] Obs 1 2 3 4 5 6 ID 111 222 333 444 555 666 N 1 1 1 1 1 1 Data set DUPS Obs 1 2 3 4 ID 444 555 666 666 N 2 2 2 3

Data set TEST contains six observations, including the first occurrence of each value of ID. (Note that N equals 1 for each.) Data set DUPS contains four observations, including the subsequent occurrences of each value of ID. (Note that N equals 2 or 3 for each.) PROS AND CONS Pros: easy to use; can output duplicates to another SAS data set; can delete duplicate observations Cons: original and duplicates not stored together; no frequencies; noncontiguous duplicates may not be detected

IDENTIFYING DUPLICATES WITH PROC FREQ


Use PROC FREQ to count the number of times each ID occurs: proc freq data=test; tables id; run; PROC FREQ prints the following output:

The FREQ Procedure Cumulative Cumulative ID Frequency Percent Frequency Percent 111 1 10.00 1 10.00 222 1 10.00 2 20.00 333 1 10.00 3 30.00 444 2 20.00 5 50.00 555 2 20.00 7 70.00 666 3 30.00 10 100.00

IDs 111, 222, and 333 each have a Frequency of 1; IDs 444 and 555 each have a Frequency of 2; and ID 666 has a Frequency of 3. In other words, there are three single values, two duplicate values, and one triplicate value.

NESUG 2006

Coders' Corner

COUNTING FREQUENCIES The Frequency indicates how many times each ID occurs. This categorizes each ID as being a single occurrence (Frequency equals 1), a duplicate (Frequency equals 2), a triplicate (Frequency equals 3), and so on. PROC FREQ may produce voluminous output, however, depending on the number of IDs. Output the frequency counts to a SAS data set, and run PROC FREQ on the Frequency variable to summarize duplicates: proc freq data=test noprint; tables id/out=freqs; run; The NOPRINT option on the PROC FREQ statement suppresses printed output. Frequency-count information for ID is saved to temporary SAS data set FREQS, which looks as follows: Data set FREQS Obs 1 2 3 4 5 6 ID 111 222 333 444 555 666 COUNT 1 1 1 2 2 3 PERCENT 10 10 10 20 20 30

Data set FREQ contains six observations and three variables. There is one observation for each value of ID. The frequency of each ID is stored in variable COUNT. The percentage that each count represents of the total number of observations is stored in variable PERCENT. Run a second PROC FREQ to process data set FREQS: proc freq data=freqs; tables count; run; PROC FREQ prints the following output:

The FREQ Procedure Frequency Count Cumulative Cumulative COUNT Frequency Percent Frequency Percent 1 3 50.00 3 50.00 2 2 33.33 5 83.33 3 1 16.67 6 100.00

COUNT has the value 1 three times (i.e., has a Frequency of 3), meaning there are three single values. COUNT has the value 2 two times, meaning there are two duplicates. COUNT has the value 3 one time, meaning there is one triplicate. The two PROC FREQ steps count all occurrences (i.e., unique values, duplicates, triplicates, and so on, depending on the data).

NESUG 2006

Coders' Corner

STORING DUPLICATES Merge data set FREQS with data set TEST to subset observations that do not have unique values of ID: data dups; merge freqs(in=inF where=(count>1) keep=id count) test (in=inT); by id; if inF and inT; drop count; run; Use the IN= data set option to create variables that store information about the origin of each observation. SAS creates temporary variables INF and INT that have the value 1 when the respective data set contributes to the current observation and the value 0 when it does not. Use the WHERE= option to merge only those observations from data set FREQS where COUNT is greater than 1 (that is, a duplicate). Use the KEEP= option to keep only the variables ID and COUNT (used in the WHERE= option). Use an IF-THEN statement to output only those observations with an ID in both data sets FREQS and TEST to data set DUPS. Use a DROP statement to drop the COUNT variable. The resulting data set looks as follows:

Data set DUPS Obs 1 2 3 4 5 6 7 ID 444 444 555 555 666 666 666 N 1 2 1 2 1 2 3

Data set TEST must be in sort order, sorted with PROC SORT, or indexed by ID. Data set FREQS the output data set from the FREQ procedure is already sorted by ID in ascending order. PROS AND CONS Pros: easy to summarize counts and percentages; data-driven Cons: requires two steps; cannot count combinations of variable values

IDENTIFYING DUPLICATES WITH PROC SQL


Use PROC SQL to count the number of unique IDs and the number of observations: proc sql; select count(distinct id) as UniqueIDs, count(*) as NObs from test; quit; The COUNT function with the DISTINCT keyword counts nonmissing unique values in a column (variable). The COUNT function with an asterisk (*) counts the number of rows (observations) in the table (data set). The AS keyword names the variable to store the result. PROC SQL prints the following output: UniqueIDs NObs 6 10
4

NESUG 2006

Coders' Corner

PROC SQL counted six unique ID values (UNIQUEIDS) out of ten observations (NOBS). This indicates that one or more ID values are duplicated, but it does not indicate which ones. This query is perhaps best used to document that there are no duplicates (i.e., when the number of UNIQUEIDS equals NOBS). Use PROC SQL to count the number of times each ID occurs: proc sql; select id, count(*) as IDCount from test group by id; quit; This query treats observations with the same value of ID as a group and counts the number of times each occurs. PROC SQL prints the following output:

ID IDCount 111 1 222 1 333 1 444 2 555 2 666 3 Save the query results to a SAS data set. Run a second query to count the number of times each IDCOUNT occurs: proc sql; create table counts as select id, count(*) as IDCount from test group by id; select IDCount, count(*) as N from counts group by IDCount; quit; CREATE TABLE outputs the results of the first query to SAS data set COUNTS. The second query treats observations with the same value of IDCOUNT as a group and counts the number of times each occurs. PROC SQL prints the following output: IDCount N 1 3 2 2 3 1

IDCOUNT 1 has an N of 3, meaning there are three single values. IDCOUNT 2 has an N of 2, meaning there are two duplicates. IDCOUNT 3 has an N of 1, meaning there is one triplicate. Note that PROC FREQ could be run on data set COUNTS to produce these results plus percentages. This would require a second step (i.e., instead of a second query within one PROC SQL step). STORING DUPLICATES Perform an inner join of observations in data sets TEST and COUNTS where the ID occurs in COUNTS: proc sql; create table dups as select T.*
5

NESUG 2006

Coders' Corner

from test T, counts C where T.id = C.id and C.IDCount > 1 ; quit; Use CREATE TABLE to create data set DUPS. Use SELECT T.* to retrieve all variables from data set TEST (aliased as T; see below). Specify FROM TEST T, COUNTS C to read observations from data sets TEST and COUNTS. The T after TEST and the C after COUNTS are aliases (temporary, alternate names for tables) that are used to identify the data set from which a variable comes. Specify WHERE T.ID = C.ID to include observations with an ID in data set TEST that matches an ID in data set COUNTS where IDCOUNT is greater than 1 (that is, a duplicate). The resulting data set looks as follows: Data set DUPS Obs 1 2 3 4 5 6 7 ID 444 444 555 555 666 666 666 N 1 2 1 2 1 2 3

PROS AND CONS Pros: easy to run different queries; can do everything in one step Cons: not obvious or intuitive

IDENTIFYING DUPLICATES WITH FIRST./LAST. VARIABLES


Use FIRST./LAST. variables to identify observations as the first, last, or neither the first nor the last time that a value occurs. FIRST./LAST. variables are created by SAS when a data set is SET BY one or more variables. The data set must be in sort order, sorted with PROC SORT, or indexed by ID. FIRST.ID and LAST.ID can have only two values, 0 or 1: data test2; set test; by id; Fid = first.id ; Lid = last.id ; The resulting data set looks as follows: Data set TEST2 Obs 1 2 3 4 5 6 7 8 9 10 ID 111 222 333 444 444 555 555 666 666 666 N 1 1 1 1 2 1 2 1 2 3 Fid 1 1 1 1 0 1 0 1 0 0 Lid 1 1 1 0 1 0 1 0 0 1
6

NESUG 2006

Coders' Corner

FIRST./LAST. variables are temporary variables (i.e., do not exist outside the DATA step in which they are created). FIRST.ID and LAST.ID are saved as FID and LID, respectively, with assignment statements, and are printed for illustrative purposes. ID 111 occurs only once. FID and LID both equal 1 (i.e., are true; it is the first time and the last time that 111 occurs). ID 222 and ID 333 also occur only once; FID and LID both equal 1 for these observations as well. ID 444, however, occurs twice: In the first occurrence, FID equals 1, but LID equals 0 (i.e., it is the first, but not the last, occurrence of ID 444); in the second occurrence, FID equals 0, but LID equals 1 (i.e., it is not the first, but is the last, occurrence of ID 444). Similar logic applies to IDS 555 and 666. Use the values of FID and LID to identify duplicates: Any observation where both FID and LID do not equal 1 means it is not the first and the last time that the value occurs (i.e., there are duplicates). STORING DUPLICATES Use the values of FID and LID to output duplicates to another SAS data set: data unique dups; set test; by id; if not(first.id and last.id) then output dups; else output unique; run; The DATA statement creates two temporary SAS data sets: UNIQUE and DUPS. The SET statement reads observations from data set TEST. The BY statement tells SAS to process observations by ID. Variables FIRST.ID and LAST.ID are created. The IF-THEN statement outputs observations where both FID and LID do not equal 1 to data set DUPS. The ELSE statement outputs all other observations (i.e., where FID and LID equal 1) to data set UNIQUE. The resulting data sets look as follows:

Data set UNIQUE Obs 1 2 3 ID 111 222 333 N 1 1 1

Data set DUPS Obs 1 2 3 4 5 6 7 ID 444 444 555 555 666 666 666 N 1 2 1 2 1 2 3

PROS AND CONS Pros: easy to output duplicates; works on combinations of BY variables; full power of DATA step available Cons: data set must be in sort order, sorted with PROC SORT, or indexed
7

NESUG 2006

Coders' Corner

CONCLUSIONS
Multiple techniques may be used to identify duplicate values. Techniques vary in complexity, ease of use, and ability to output duplicates to another data set. PROC SORT may be the fastest way to eliminate duplicates. PROC FREQ is the simplest way to summarize duplicates. PROC SQL is the easiest way to query duplicates. FIRST./LAST. variables may be the most powerful tool for identifying duplicates.

ACKNOWLEDGMENTS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.

CONTACT INFORMATION
Christopher J. Bost Director, Research Technology Unit MDRC 16 East 34th Street, 19th Floor New York, NY 10016 (212) 340-8613 telephone (212) 684-0832 fax christopher.bost@mdrc.org www.mdrc.org