Checking For Duplicates

Checking for Duplicates On any version of SQL Server, you can identify duplicates using a simple query, with
GROUP BY and HAVING, as follows: DECLARE @table TABLE (data VARCHAR(20)) INSERT INTO @table VALUES ('not duplicate row') INSERT INTO @table VALUES ('duplicate row') INSERT INTO @table VALUES ('duplicate row') SELECT data , COUNT(data) nr FROM @table GROUP BY data HAVING COUNT(data) > 1 Removing Duplicate Rows in SQL Server The following sections present a variety of techniques for removing duplicates from SQL Server database tables, depending on the nature of the table design. Tables with no primary key When you have duplicates in a table that has no primary key defined, and you are using an older version of SQL Server, such as SQL Server 2000, you do not have an easy way to identify a single row. Therefore, you cannot simply delete this row by specifying a WHERE clause in a DELETE statement. You can, however, use the SET ROWCOUNT 1 command, which will restrict the subsequent DELETE statement to removing only one row. For example: DECLARE @table TABLE (data VARCHAR(20)) INSERT INTO @table VALUES ('not duplicate row') INSERT INTO @table VALUES ('duplicate row') INSERT INTO @table VALUES ('duplicate row') SET ROWCOUNT 1 DELETE FROM @table WHERE data = 'duplicate row' SET ROWCOUNT 0
In the above example, only one row is deleted. Consequently, there will be one remaining row with the content duplicate row. If you have more than one duplicate of a particular row, you would simply adjust the ROWCOUNT accordingly. Note that after the delete, you should reset the ROWCOUNT to 0 so that subsequent queries are not affected. To remove all duplicates in a single pass, the following code will work, but is likely to be horrendously slow if there are a large number of duplicates and table rows: DECLARE @table TABLE (data VARCHAR(20)) INSERT INTO @table VALUES ('not duplicate row') INSERT INTO @table VALUES ('duplicate row') INSERT INTO @table VALUES ('duplicate row') SET NOCOUNT ON SET ROWCOUNT 1 WHILE 1 = 1 BEGIN DELETE FROM @table WHERE data IN (SELECT data FROM @table GROUP BY data HAVING COUNT(*) > 1) IF @@Rowcount = 0 BREAK ; END SET ROWCOUNT 0 When cleaning up a table that has a large number of duplicate rows, a better approach is to select just a distinct list of the duplicates, delete all occurrences of those duplicate entries from the original and then insert the list into the original table. DECLARE @table TABLE(data VARCHAR(20)) INSERT INTO @table VALUES ('not duplicate row') INSERT INTO @table VALUES ('duplicate row')
INSERT INTO @table VALUES ('duplicate row') INSERT INTO @table VALUES ('second duplicate row') INSERT INTO @table VALUES ('second duplicate row') SELECT data INTO #duplicates FROM @table GROUP BY data HAVING COUNT(*) > 1 -- delete all rows that are duplicated DELETE FROM @table FROM @table o INNER JOIN #duplicates d ON d.data = o.data -- insert one row for every duplicate set INSERT INTO @table(data) SELECT data FROM #duplicates As a variation of this technique, you could select all the data, without duplicates, into a new table, delete the old table, and then rename the new table to match the name of the original table: CREATE TABLE duplicateTable3(data VARCHAR(20)) INSERT INTO duplicateTable3 VALUES ('not duplicate row') INSERT INTO duplicateTable3 VALUES ('duplicate row') INSERT INTO duplicateTable3 VALUES ('duplicate row') INSERT INTO duplicateTable3 VALUES ('second duplicate row') INSERT INTO duplicateTable3 VALUES ('second duplicate row') SELECT DISTINCT data INTO tempTable
FROM duplicateTable3 GO TRUNCATE TABLE duplicateTable3 DROP TABLE duplicateTable3 exec sp_rename 'tempTable', 'duplicateTable3' In this solution, the SELECT DISTINCT will select all the rows from our table except for the duplicates. These rows are immediately inserted into a table named tempTable. This is a temporary table in the sense that we will use it to temporarily store the unique rows. However, it is not a true temporary table (i.e. one that lives in the temporary database), because we need the table to exist in the current database, so that it can later be renamed, using sp_Rename. The sp_Rename command is an absolutely horrible way of renaming textual objects, such as stored procedures, because it does not update all the system tables consistently. However, it works well for nontextual schema objects, such as tables. Note that this solution is usually used on table that has no primary key. If there is a key, and there are foreign keys referencing the rows that are identified as being duplicates, then the foreign key constraints need to be dropped and re-created again during the table swap. Tables with a primary key, but no foreign key constraints If your table has a primary key, but no foreign key constraints, then the following solution offers a way to remove duplicates that is much quicker, as it entails less iteration: DECLARE @table TABLE( id INT IDENTITY(1, 1) , data VARCHAR(20) ) INSERT INTO @table VALUES ('not duplicate row') INSERT INTO @table VALUES ('duplicate row') INSERT INTO @table VALUES ('duplicate row') WHILE 1 = 1 BEGIN DELETE FROM @table WHERE id IN (SELECT MAX(id)
FROM @table GROUP BY data HAVING COUNT(*) > 1) IF @@Rowcount = 0 BREAK ; END Unfortunately, this sort of technique does not scale well. If your table has a reliable primary key, for example one that has an assigned a value that can be used in a comparison, such as a numeric value in a column with the IDENTITY property enabled, then the following approach is probably the neatest and best. Essentially, it deletes all the duplicates except for the one with the highest value for the primary key. If a table has a unique column such as a number or integer, that will reliably return just one value with MAX() or MIN(), then you can use this technique to identify the chosen survivor of the group of duplicates. DECLARE @table TABLE ( id INT IDENTITY(1, 1) , data VARCHAR(20) ) INSERT INTO @table VALUES ('not duplicate row') INSERT INTO @table VALUES ('duplicate row') INSERT INTO @table VALUES ('duplicate row') INSERT INTO @table VALUES ('second duplicate row') INSERT INTO @table VALUES ('second duplicate row') DELETE FROM @table FROM @table o INNER JOIN ( SELECT data FROM @table GROUP BY data HAVING COUNT(*) > 1
) f ON o.data = f.data LEFT OUTER JOIN ( SELECT [id] = MAX(id) FROM @table GROUP BY data HAVING COUNT(*) > 1 ) g ON o.id = g.id WHERE g.id IS NULL This can be simplified even further, though the logic is rather harder to follow. DELETE FROM f FROM @table AS f INNER JOIN @table AS g ON g.data = f.data AND f.id < g.id Tables that are referenced by a Foreign Key If you've youve set up your constraints properly then you will be unable to delete duplicate rows from a table that is referenced by another table, using the above techniques unless you have specified cascading deletes in the foreign key constraints. You can alter existing foreign key constraints by adding a cascading delete on the foreign key constraint. This means that rows in other tables that refer to the duplicate row via a foreign key constraint will be deleted. Because you will lose the referenced data as well as the duplicate, you are more likely to wish to save the duplicate data in its entirety first in a holding table. When you are dealing with real data, you are likely to need to identify the duplicate rows that are being referred to, and delete the duplicates that are not referenced, or merge duplicates and update the references. This task will probably have to be done manually in order to ensure data integrity. Tables with columns that cannot have a UNIQUE constraint Sometimes, of course, you may have columns on which you cannot define a unique constraint, or you cannot even use the DISTINCT keyword. Large object types, like NTEXT, TEXT and IMAGE in SQL Server 2000 are good examples of this. These are data types that cannot be compared, and so the above solutions would not work. In these situations, you will need to add an extra column to the table that you could use as a surrogate key. Such a surrogate key is not derived from the application data. Its value may be automatically generated, similarly to the identity columns in our previous examples. Unfortunately, in SQL Server, you cannot add an identity column to a table as part of the ALTER TABLE command. The only way to add such a column is to rebuild the table, using SELECT INTO and the IDENTITY() function, as follows: CREATE TABLE duplicateTable4 (data NTEXT)
INSERT INTO duplicateTable4 VALUES ('not duplicate row') INSERT INTO duplicateTable4 VALUES ('duplicate row') INSERT INTO duplicateTable4 VALUES ('duplicate row') INSERT INTO duplicateTable4 VALUES ('second duplicate row') INSERT INTO duplicateTable4 VALUES ('second duplicate row') SELECT IDENTITY( INT, 1,1 ) AS id, data INTO duplicateTable4_Copy FROM duplicateTable4 The above will create the duplicateTable4_Copy table. This table will have an identity column named id, which will already have unique numeric values set. Note that although we are creating an Identity column, uniqueness is not enforced in this case; you will need to add a unique index or define the id column as a primary key. Using a cursor People with application development background would consider using a cursor to try to eliminate duplicates. The basic idea is to order the contents of the table, iterate through the ordered rows, and check if the current row is equal to the previous row. If it does, then delete the row. This solution could look like the following in T-SQL: CREATE TABLE duplicateTable5 (data varchar(30)) INSERT INTO duplicateTable5 VALUES ('not duplicate row') INSERT INTO duplicateTable5 VALUES ('duplicate row') INSERT INTO duplicateTable5 VALUES ('duplicate row') INSERT INTO duplicateTable5 VALUES ('second duplicate row') INSERT INTO duplicateTable5 VALUES ('second duplicate row') DECLARE @data VARCHAR(30), @previousData VARCHAR(30) DECLARE cursor1 CURSOR SCROLL_LOCKS FOR SELECT data FROM duplicateTable5
ORDER BY data FOR UPDATE OPEN cursor1 FETCH NEXT FROM cursor1 INTO @data WHILE @@FETCH_STATUS = 0 BEGIN IF @previousData = @data DELETE FROM duplicateTable5 WHERE CURRENT OF cursor1 SET @previousData = @data FETCH NEXT FROM cursor1 INTO @data END CLOSE cursor1 DEALLOCATE cursor1 The above script will not work, because once you apply the ORDER BY clause in the cursor declaration the cursor will become read-only. If you remove the ORDER BY clause, then there will be no guarantee that the rows will be in order, and checking two subsequent rows would no longer be sufficient to identify duplicates. Interestingly, since the above example creates a small table where all the rows fit onto a single database page and duplicate rows are inserted in groups, removing the ORDER BY clause does make the cursor solution work. It will fail, however, with any table that is larger and has seen some modifications.
Problem In data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP (On Line Transaction Processing) applications we are often encountered with duplicate records in our table. To make the table data consistent and accurate we need to get rid of these duplicate records keeping only one of them in the table. In this tip I discuss different strategies which you can take for this, along with the pros and cons. Solution There are different methods for deleting duplicate (de-duplication) records from a table, each of them has its own pros and cons. I am going to discuss these methods, prerequisite of each of these methods along with its pros and cons. 1. Using correlated subquery 2. Using temporary table
3. 4. 5. 6.
Creating new table with distinct records and renaming it.. Using Common Table Expression (CTE) Using Fuzzy Group Transformation in SSIS Using MERGE Statement
1. Using correlated subquery If you already have a identity column on your table, your work is half done. You can use a correlated subquery to get rid of the duplicates. First let me briefly tell you how a correlated subquery works. In a correlated subquery, first outer query is evaluated, the result from the outer query is used by an inner sub query for its evaluation, whatever the outcome of the inner sub-query is again used by the outer query to get the final resultset. To learn more about correlated subqueries, you can click here. In the example below, for the data deletion I am joining the inner query columns with the outer query to find the record with the maximum ID (you can even use minimum also and change the predicate to ">" from "<"). Then I am deleting all the records which has an ID less than what we have got from the inner query. Please note, this approach can be taken only if you have identity column on the target table or you are willing to alter your target table to add an identity column which would require ALTER TABLE permission. Script #1 - De-duplication with correlated subquery CREATE TABLE Employee ( [ID] INT IDENTITY, [FirstName] Varchar(100), [LastName] Varchar(100), [Address] Varchar(100), ) GO INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Linda', 'Mitchel', 'America') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Linda', 'Mitchel', 'America') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('John', 'Albert', 'Australia') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('John', 'Albert', 'Australia') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('John', 'Albert', 'Australia') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Arshad', 'Ali', 'India') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Arshad', 'Ali', 'India') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Arshad', 'Ali', 'India') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Arshad', 'Ali', 'India') GO
SELECT * FROM Employee GO --Selecting distinct records SELECT * FROM Employee E1 WHERE E1.ID = ( SELECT MAX(ID) FROM Employee E2 WHERE E2.FirstName = E1.FirstName AND E1.LastName = E2.LastName AND E1.Address = E2.Address) GO --Deleting duplicates DELETE Employee WHERE ID < ( SELECT MAX(ID) FROM Employee E2 WHERE E2.FirstName = Employee.FirstName AND E2.LastName = Employee.LastName AND E2.Address = Employee.Address) GO SELECT * FROM Employee GO
2. Using temporary table In this approach we pull distinct records from the target table into a temporary table, then truncate the target table and finally insert the records from the temporary table back to the target table as you can see in Script #3. Three things you need to be aware of when you are using this approach. First you need to make sure you have or set enough size for tempdb database to hold all the distinct records especially if it is very large result-set. Second you need to make sure you perform this operation in a transaction, at least the TRUNCATE and INSERT parts so that you are not left with an another problem if it fails in between for any reason. Third you need to have the required permissions for object creation/truncation.
Script #2, creates a table and inserts some records along with some duplicate records which we will be using in all further examples. Script #2 - Creating a table with duplicate records CREATE TABLE Employee ( [FirstName] Varchar(100), [LastName] Varchar(100), [Address] Varchar(100), ) GO INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Linda', 'Mitchel', 'America') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Linda', 'Mitchel', 'America') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('John', 'Albert', 'Australia') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('John', 'Albert', 'Australia')
INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('John', 'Albert', 'Australia') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Arshad', 'Ali', 'India') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Arshad', 'Ali', 'India') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Arshad', 'Ali', 'India') INSERT INTO Employee([FirstName], [LastName], [Address]) VALUES ('Arshad', 'Ali', 'India') GO SELECT * FROM Employee GO Script #3 - Using temporary table BEGIN TRAN -- Pull distinct records in the temporary table SELECT DISTINCT * INTO #Employee FROM Employee --Truncate the target table TRUNCATE TABLE Employee --Insert the distinct records from temporary table --back to target table INSERT INTO Employee SELECT * FROM #Employee --Drop the temporary table IF OBJECT_ID('tempdb..#Employee') IS NOT NULL DROP TABLE #Employee COMMIT TRAN GO SELECT * FROM Employee GO
3. Creating new table with distinct records and renaming it In this approach we create a new table with all distinct records, drop the existing target table and rename the newly created table with the original target table name. Please note, with this approach the meta-data about the target table will change for example object id, object creation date etc. so if you have any dependencies on these you have to take them into consideration. Three things you need to aware of when you are using this approach. First you need to make sure you have enough space in your database in the default filgroup (if you want your new table to be on some other file group than the default filegroup then you need to create a table first and then use INSERT INTO....SELECT * FROM) to hold all the distinct records especially if it is very large result-set. Second you need to make sure you perform this operation in a transaction, at least the DROP and RENAME part so that you are not left with an another problem if it fails in between for any reason. Third you need to have required permissions for object creation/drop. Script #4 - New table with distinct only
BEGIN TRAN -- Pull distinct records in a new table SELECT DISTINCT * INTO EmployeeNew FROM Employee --Drop the old target table DROP TABLE Employee --rename the new table EXEC sp_rename 'EmployeeNew', 'Employee' COMMIT TRAN GO SELECT * FROM Employee GO
4. Using Common Table Expression (CTE) SQL Server 2005 introduced Common Table Expression (CTE) which acts as a temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. In this example I am using a CTE for de-duplication. I am using the ROW_NUMBER function to return the sequential number of each row within a partition of a result set which is a grouping based on [FirstName], [LastName], [Address] columns (or columns of the table) and then I am deleting all records except where the sequential number is 1. This means keeping one record from the group and deleting all other similar/duplicate records. This is one of the efficient methods to delete records and I would suggest using this if you have SQL Server 2005 or 2008. Script #5 - Using CTE for de-duplication --example 1 WITH CTE AS ( SELECT ROW_NUMBER() OVER (PARTITION BY [FirstName], [LastName], [Address] Order BY [FirstName] DESC, [LastName] DESC, [Address] DESC ) AS RowNumber FROM Employee tbl WHERE EXISTS (SELECT TOP 1 1 FROM (SELECT FirstName,LastName,Address FROM Employee GROUP BY [FirstName], [LastName], [Address] HAVING COUNT(*) > 1 ) GrpTable WHERE GrpTable.FirstName = tbl.FirstName AND GrpTable.LastName = tbl.LastName AND GrpTable.Address = tbl.Address) ) DELETE FROM CTE Where RowNumber > 1 GO SELECT * FROM Employee GO --A more simplified and faster example WITH CTE AS ( SELECT ROW_NUMBER() OVER
(PARTITION BY [FirstName], [LastName], [Address] Order BY [FirstName] DESC, [LastName] DESC, [Address] DESC ) AS RowNumber, [FirstName], [LastName], [Address] FROM Employee tbl ) DELETE FROM CTE Where RowNumber > 1 GO SELECT * FROM Employee GO
5. Using Fuzzy Group Transformation in SSIS If you are using SSIS to upload data to your target table, you can use a Fuzzy Grouping Transformation before inserting records to the destination table to ignore duplicate records and insert only unique records. Here, in the image below, you can see 9 records are coming from source, but only 3 records are being inserted into the target table, that's because only 3 records are unique out of the 9 records. Refer to Script #2 above to see more about these 9 records that were used.
In the Fuzzy Grouping Transformation editor, on the Columns tab you specify the columns which you want to be included in grouping. As you can see in the below image I have chosen all 3 columns in my consideration for grouping.
In the Fuzzy Grouping Transformation, you might add a conditional split to direct unique rows or duplicate rows to two destinations. Here in the example you can see I am routing all the unique rows to the destination table and ignoring the duplicate records. The Fuzzy Grouping Transformation produces a few additional columns like _key_in which uniquely identifies each rows, _key_out which identifies a group of duplicate records etc.
6. Using MERGE Statement Beginning with SQL Server 2008, now you can use MERGE SQL command to perform INSERT/UPDATE/DELETE operations in a single statement. This new command is similar to the UPSERT (fusion of the words UPDATE and INSERT.) command of Oracle. It inserts rows that dont exist and updates the rows that do exist. With the introduction of the MERGE SQL command, developers can more effectively handle common data warehousing scenarios, like checking whether a row exists and then executing an insert or update or delete. The MERGE statement basically merges data from a source result set to a target table based on a condition that you specify and if the data from the source already exists in the target or not. The new SQL command combines the sequence of conditional INSERT, UPDATE and DELETE commands in a single atomic statement, depending on the existence of a record. With this you can make sure no duplicate records are being inserted into the target table, but rather updated if there is any change and only new records are inserted which do not already exist in the target. For more information about this you can click here.

Checking For Duplicates

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Checking For Duplicates

Transféré par

Droits d'auteur :

Formats disponibles

Checking for Duplicates On any version of SQL Server, you can identify duplicates using a simple query, with

Vous aimerez peut-être aussi