Data Management – Finding, Removing Duplicate Rows Using SQL and Some Prevention Tips

find and remove duplicate dataDuplicate data is a common problem and in my experience comes in two main flavours; exact duplicate data and similar data. Exact duplicate data is when a data source contains the same values across all fields in the record. Similar data is when some of the fields match and to the naked eye can be classed as duplicate data but in fact it isn’t – it’s similar data and therefore it is unique.

In the case of similar data, an example might be where a file has been supplied which contains names and addresses. The file has been maintained in a spreadsheet and by more than one person. For whatever reason, the same name and address has been entered twice on two rows, except one editor did not include the postal code for the address. As a result, the rows are similar but not exact duplicates of each other. Apart from this, they are duplicate data.

Typically when loading data into a database table with unique constraints, de-duping the incoming data is a necessary step. For the purposes of this article, we will cover identifying and addressing exact duplicate data.

Lets imagine a scenario where you have been asked to de-duplicate a list of item codes and the data looks like this:

ItemCode
123456789
123456789
223344789
223344789
987554321
987554321

CREATE TABLE #ItemCodes
(
  ItemCode VARCHAR(50)
);

INSERT INTO #ItemCodes
VALUES
('123456789'),
('123456789'),
('223344789'),
('223344789'),
('987554321'),
('987554321');

Identifying Duplicate Rows Using SQL

The first step is to find the duplicate data. This can be easily achieved using a simple select query in combination GROUP BY and HAVING. If we loaded this data into a staging table with no constraints, we could proceed to query the data like this:

SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM #ItemCodes
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1

The resulting output reduces our 6 records to 3 because the the HAVING clause only returns the records that have more than one instance of ItemCode occurring.

How to De-duplicate Duplicate Data

Now that we have identified the duplicate data, we can do something about it. We also know how many duplicates we are dealing with. In the instance of a 6 row table, it could be easy just to manually remove the rows :) However these tasks are typically more complicated involving many more rows than this.

The first step is to make the data unique and the easiest way to do this is add some unique identity column to the data. If the duplicate data is already in a table in production, adding a unique identifier is still the way to go but this might mean exporting the data first to a new table, deleting it from the source table and then re-importing after de-duplication has happened.

If the data is being loaded into a staging table then it is easy enough to add a column for the unique identity.

ALTER TABLE #ItemCodes 
ADD ID INT IDENTITY(1,1) NOT NULL;

Querying the data now, it is unique:

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
123456789                                          2
223344789                                          3
223344789                                          4
987554321                                          5
987554321                                          6

(6 row(s) affected)

Lets now delete that duplicate data!

We’re ready to remove the duplicate data from our table. This is done using a DELETE statement and a self join, like this:

DELETE t1
FROM #ItemCodes t1
JOIN #ItemCodes t2 
ON t1.ItemCode = t2.ItemCode 
AND t1.ID > t2.ID;

(3 row(s) affected)

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
223344789                                          3
987554321                                          5

(3 row(s) affected)

We’re now left with 3 rows, instead of 6. So our ItemCodes are now unique and can be loaded into our main database table. As you can see, ID’s 2, 4 and 6 were removed within the JOIN where t1.ID > t2.ID

Beware of lots of data and performance degeneration!

In an older post, I wrote about deleting rows from a table in a way which did not negatively effect performance too much. It’s worth a read as it illustrates what can happen if large transactions are issued in order to remove data. If you are dealing with large volumes of data, consider how to break the delete operation down into smaller parts to reduce the impact on your system if other users are querying the server.

Prevention is better than the cure!

Duplicate data can cost your business a lot of money in lost productivity time and incorrect results. Cleaning up the data is an expensive operation and can be ongoing until the source of the problem is addressed. Rather than continually repeating de-duplication tasks, communicate to the business the need to tackle the problem at the source.

As I referred to earlier, there are two forms of duplicate data, similar and exact. Preventing exact duplicate data means looking at your database and ensuring that the correct constraints are in place to prevent duplicate values. Such constraints being primary keys and unique column constraints.

When tackling similar duplicate data this can be a combination of ensuring that there are the appropriate constraints in place at both the database and application layers. For example if you have invested in a CRM system, whether written in-house of purchased off the shelf/cloud, make sure that it has sufficient duplicate detectors in place to at least warn the user that when entering a record, that there is a chance that the record is a duplicate.

If you have an application that has been written in-house for some function such as customer entry, ensure that there is address lookup functionality to help ensure that address information is entered in a clean way to help minimize the chances of mistyped address data, resulting in similar data.

Be proactive!

Similar duplicate data can creep into your system because it passes the application and database checks. I refer back to the example earlier in the post when a customer record existed twice, one with and one without a postal code. It’s important to realize early that a proactive approach to managing the duplicates can reap long term benefits, reducing overall cost to the business because the duplicates are addressed in a proactive way.

Some systems will provide reports which can help you identify duplicate data. Alternatively, you can run some SQL to identify duplicate rows in your tables and take the appropriate steps. Lets go back to our earlier example where we utilized the GROUP BY HAVING SQL but this time we will enhance it.

There is a table used in this query to contain items with descriptions, when they were added and by who. The SQL below shows the results of the GROUP BY and performs a JOIN back to the Items table to retrieve all fields for rows which have duplicate item codes. This helps provide a view of similar data with duplicate item codes.

CREATE TABLE Items
(
  ItemCode VARCHAR(50)
, DateAdded DATETIME
, Item VARCHAR(50)
, AddedByUser VARCHAR(10)
);

WITH Dupes
AS
(SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM Items
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1)

SELECT * 
FROM Items
JOIN Dupes 
ON Dupes.ItemCode = Items.ItemCode;

I hope you find this post useful. :)

T-SQL – Determine the Difference in Values Between Columns in Different Rows

Calculate Difference Between Columns in Different Rows

Have you ever needed to calculate the difference between columns in two or more rows in a table using a T-SQL query? For example you might want to run some trend analysis or audit data for changes. Thankfully using ROW_NUMBER() and Common Table … [Continue reading]

SQL – Using CASE with ORDER BY in Stored Procedures

When writing SQL it is possible to use ORDER BY with CASE conditions to help provide more flexibility over the returned order of a record set and this is especially useful when writing stored procedures which require a dynamic ORDER BY clause. This … [Continue reading]

How to Produce CSV Format Using T-SQL

There may be some requirements where you need to produce a csv output from a table. There are some good, efficient ways to do this and there are also some less efficient ways. Let us now look at one optimized approach and a couple of lesser optimized … [Continue reading]

T-SQL – Derived Tables Demonstrated and Explained

T-SQL is a very powerful language when dealing with sets and in this post, I am going to be demonstrating how to use derived tables.Derived tables are not the same as a temporary or physical tables where the fields and data types are declared. … [Continue reading]