DBA Diaries

Thoughts and experiences of a DBA working with SQL Server and MySQL

Data Management – Finding, Removing Duplicate Rows Using SQL and Some Prevention Tips

Posted on July 4, 2015 Written by Andy Hayes 2 Comments

find and remove duplicate dataDuplicate data is a common problem and in my experience comes in two main flavours; exact duplicate data and similar data. Exact duplicate data is when a data source contains the same values across all fields in the record. Similar data is when some of the fields match and to the naked eye can be classed as duplicate data but in fact it isn’t – it’s similar data and therefore it is unique.

In the case of similar data, an example might be where a file has been supplied which contains names and addresses. The file has been maintained in a spreadsheet and by more than one person. For whatever reason, the same name and address has been entered twice on two rows, except one editor did not include the postal code for the address. As a result, the rows are similar but not exact duplicates of each other. Apart from this, they are duplicate data.

Typically when loading data into a database table with unique constraints, de-duping the incoming data is a necessary step. For the purposes of this article, we will cover identifying and addressing exact duplicate data.

Lets imagine a scenario where you have been asked to de-duplicate a list of item codes and the data looks like this:

ItemCode
123456789
123456789
223344789
223344789
987554321
987554321

CREATE TABLE #ItemCodes
(
  ItemCode VARCHAR(50)
);

INSERT INTO #ItemCodes
VALUES
('123456789'),
('123456789'),
('223344789'),
('223344789'),
('987554321'),
('987554321');

Identifying Duplicate Rows Using SQL

The first step is to find the duplicate data. This can be easily achieved using a simple select query in combination GROUP BY and HAVING. If we loaded this data into a staging table with no constraints, we could proceed to query the data like this:

SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM #ItemCodes
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1

The resulting output reduces our 6 records to 3 because the the HAVING clause only returns the records that have more than one instance of ItemCode occurring.

How to De-duplicate Duplicate Data

Now that we have identified the duplicate data, we can do something about it. We also know how many duplicates we are dealing with. In the instance of a 6 row table, it could be easy just to manually remove the rows 🙂 However these tasks are typically more complicated involving many more rows than this.

The first step is to make the data unique and the easiest way to do this is add some unique identity column to the data. If the duplicate data is already in a table in production, adding a unique identifier is still the way to go but this might mean exporting the data first to a new table, deleting it from the source table and then re-importing after de-duplication has happened.

If the data is being loaded into a staging table then it is easy enough to add a column for the unique identity.

ALTER TABLE #ItemCodes 
ADD ID INT IDENTITY(1,1) NOT NULL;

Querying the data now, it is unique:

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
123456789                                          2
223344789                                          3
223344789                                          4
987554321                                          5
987554321                                          6

(6 row(s) affected)

Lets now delete that duplicate data!

We’re ready to remove the duplicate data from our table. This is done using a DELETE statement and a self join, like this:

DELETE t1
FROM #ItemCodes t1
JOIN #ItemCodes t2 
ON t1.ItemCode = t2.ItemCode 
AND t1.ID > t2.ID;

(3 row(s) affected)

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
223344789                                          3
987554321                                          5

(3 row(s) affected)

We’re now left with 3 rows, instead of 6. So our ItemCodes are now unique and can be loaded into our main database table. As you can see, ID’s 2, 4 and 6 were removed within the JOIN where t1.ID > t2.ID

Beware of lots of data and performance degeneration!

In an older post, I wrote about deleting rows from a table in a way which did not negatively effect performance too much. It’s worth a read as it illustrates what can happen if large transactions are issued in order to remove data. If you are dealing with large volumes of data, consider how to break the delete operation down into smaller parts to reduce the impact on your system if other users are querying the server.

Prevention is better than the cure!

Duplicate data can cost your business a lot of money in lost productivity time and incorrect results. Cleaning up the data is an expensive operation and can be ongoing until the source of the problem is addressed. Rather than continually repeating de-duplication tasks, communicate to the business the need to tackle the problem at the source.

As I referred to earlier, there are two forms of duplicate data, similar and exact. Preventing exact duplicate data means looking at your database and ensuring that the correct constraints are in place to prevent duplicate values. Such constraints being primary keys and unique column constraints.

When tackling similar duplicate data this can be a combination of ensuring that there are the appropriate constraints in place at both the database and application layers. For example if you have invested in a CRM system, whether written in-house of purchased off the shelf/cloud, make sure that it has sufficient duplicate detectors in place to at least warn the user that when entering a record, that there is a chance that the record is a duplicate.

If you have an application that has been written in-house for some function such as customer entry, ensure that there is address lookup functionality to help ensure that address information is entered in a clean way to help minimize the chances of mistyped address data, resulting in similar data.

Be proactive!

Similar duplicate data can creep into your system because it passes the application and database checks. I refer back to the example earlier in the post when a customer record existed twice, one with and one without a postal code. It’s important to realize early that a proactive approach to managing the duplicates can reap long term benefits, reducing overall cost to the business because the duplicates are addressed in a proactive way.

Some systems will provide reports which can help you identify duplicate data. Alternatively, you can run some SQL to identify duplicate rows in your tables and take the appropriate steps. Lets go back to our earlier example where we utilized the GROUP BY HAVING SQL but this time we will enhance it.

There is a table used in this query to contain items with descriptions, when they were added and by who. The SQL below shows the results of the GROUP BY and performs a JOIN back to the Items table to retrieve all fields for rows which have duplicate item codes. This helps provide a view of similar data with duplicate item codes.

CREATE TABLE Items
(
  ItemCode VARCHAR(50)
, DateAdded DATETIME
, Item VARCHAR(50)
, AddedByUser VARCHAR(10)
);

WITH Dupes
AS
(SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM Items
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1)

SELECT * 
FROM Items
JOIN Dupes 
ON Dupes.ItemCode = Items.ItemCode;

I hope you find this post useful. 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: data management, sql, sql server, t-sql

T-SQL – Determine the Difference in Values Between Columns in Different Rows

Posted on June 21, 2015 Written by Andy Hayes Leave a Comment

Have you ever needed to calculate the difference between columns in two or more rows in a table using a T-SQL query? For example you might want to run some trend analysis or audit data for changes. Thankfully using ROW_NUMBER() and Common Table Expressions (CTE’s), this can easily be achieved.

--create our table
DECLARE @TableRows TABLE
(
DateAdded SMALLDATETIME,
Points TINYINT
);

--insert some data
INSERT INTO @TableRows
VALUES
('20150615 15:11:05',10),
('20150615 16:04:23',15),
('20150615 16:23:01',21),
('20150615 16:30:50',22),
('20150615 17:04:07',30);

--set up a CTE which we will perform a self join on
WITH ExampleCTE
AS
(SELECT 
	  ROW_NUMBER() OVER(ORDER BY DateAdded) AS RowNum
	, DateAdded
	, Points
FROM @TableRows)

--now query the CTE using the self join to get our result
SELECT 
	  t1.DateAdded
	, t2.DateAdded
	, t1.Points
	, t2.Points
	, t1.Points - t2.Points AS PointsDifference
FROM ExampleCTE t1
LEFT JOIN ExampleCTE t2 ON T1.RowNum = T2.RowNum + 1

The resulting output above shows the calculated column “PointsDifference” containing the difference between the current row and the row previous to it.

Calculate Difference Between Columns in Different Rows

The magic happens in the LEFT JOIN part of the query. By adding 1 to the value of the RowNum column, the resulting calculation within the join allows the analysis between the “Points” columns to be accurately returned.

So simple, so effective. I love T-SQL! 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: sql server, sql server 2008, sql server 2012, t-sql

SQL – Using CASE with ORDER BY in Stored Procedures

Posted on May 31, 2015 Written by Andy Hayes 2 Comments

When writing SQL it is possible to use ORDER BY with CASE conditions to help provide more flexibility over the returned order of a record set and this is especially useful when writing stored procedures which require a dynamic ORDER BY clause. This method does not require any dynamic SQL build up strings executed either using EXEC or sp_executeSQL

Lets take this example using the AdventureWorks database that is downloadable for SQL Server.

This procedure which is designed to accept three parameters which will help return records for a sales person  and in a specified column order and in a particular sort order.

CREATE PROCEDURE usp_GetOrders @SalesPersonID INT, @OrderBy VARCHAR(20), @SortOrder CHAR(4)
AS
SELECT SalesOrderNumber, PurchaseOrderNumber, TotalDue, OrderDate, DueDate, AccountNumber  
FROM Sales.SalesOrderHeader
WHERE SalesPersonID = @SalesPersonID
ORDER BY 
  CASE WHEN @OrderBy = 'Due Date' AND @SortOrder = 'DESC' THEN DueDate END DESC,
  CASE WHEN @OrderBy = 'Due Date' THEN DueDate END,
  CASE WHEN @OrderBy = 'Order Date' AND @SortOrder = 'DESC' THEN OrderDate END DESC,
  CASE WHEN @OrderBy = 'Order Date' THEN OrderDate END,
  CASE WHEN @OrderBy = 'Total Due' AND @SortOrder = 'DESC' THEN TotalDue END DESC,
  CASE WHEN @OrderBy = 'Total Due' THEN TotalDue END

One thing you might notice, I have supplied the DESC version of the CASE first for each column. There are two reasons for this; the default sort order if often ASC so I have not supplied an ASC for second CASE for each column but in doing this (the second reason), the CASE will not work correctly if the DESC is the second line as the first line will always return true.

So this:

CASE WHEN @OrderBy = 'Due Date' THEN DueDate END,  
CASE WHEN @OrderBy = 'Due Date' AND @SortOrder = 'DESC' THEN DueDate END DESC,

Will always return the data in ASC order regardless of whether I supply DESC in @SortOrder. In this case either make DESC the first line as written or re-write as:

CASE WHEN @OrderBy = 'Due Date' AND @SortOrder = 'ASC' THEN DueDate END,  
CASE WHEN @OrderBy = 'Due Date' AND @SortOrder = 'DESC' THEN DueDate END DESC,

I hope you found this useful. If you know of better ways to achieve this, please leave a comment. I’d love to hear from you! 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: sql server, t-sql

How to Produce CSV Format Using T-SQL

Posted on May 30, 2015 Written by Andy Hayes Leave a Comment

There may be some requirements where you need to produce a csv output from a table. There are some good, efficient ways to do this and there are also some less efficient ways. Let us now look at one optimized approach and a couple of lesser optimized approaches 😉

SQL performs best working with sets

This is true and should always be the first thought when designing a query.

So instead of querying the database row by row by firing multiple select statements, using loops or cursors, always try and think how can you achieve the operation in a set based way.

To help demonstrate this thought process, lets take this example table and its data

T-SQL:

SET NOCOUNT ON;
DECLARE @Guid UNIQUEIDENTIFIER = NEWID()

CREATE TABLE CSVOutputDemo
(
ID INT PRIMARY KEY IDENTITY(1,1),
ReferenceId UNIQUEIDENTIFIER NOT NULL,
ImageName VARCHAR(500)
)

--add some data

INSERT INTO CSVOutputDemo(ReferenceId,ImageName)
VALUES
(@Guid, 'image1.jpg'),
(@Guid, 'image2.jpg'),
(@Guid, 'image3.jpg')

SELECT * FROM CSVOutputDemo
ID          ReferenceId                          ImageName
----------- ------------------------------------ --------------------------------------------------
1           6083FFAF-8C29-489A-9486-07F2BABDC264 image1.jpg
2           6083FFAF-8C29-489A-9486-07F2BABDC264 image2.jpg
3           6083FFAF-8C29-489A-9486-07F2BABDC264 image3.jpg

The desired output is that we want all the values in the ImageURL field to be returned from the table as a comma separated value string, aka CSV output

image1,jpg,image2.jpg,image3.jpg

How to take three strings from three rows and turn them into one string on a single row?

You can take advantage of FOR XML which is used in conjunction with the SELECT statement. It’s really neat 🙂

By placing FOR XML PATH(”) at the end of the query, you can see that already, the rows returned have reduced from 3 to 1

SELECT ImageName FROM CSVOutputDemo FOR XML PATH('')

Returns…

XML_F52E2B61-18A1-11d1-B105-00805F49916B
<ImageName>image1.jpg</ImageName<ImageName>image2.jpg</ImageName><ImageName>image3.jpg</ImageName>

However, the output isn’t quite as we want it just yet so we have to make some adjustments to add the comma’s

SELECT ',' + ImageName FROM CSVOutputDemo FOR XML PATH('')

XML_F52E2B61-18A1-11d1-B105-00805F49916B
----------------------------------------
,image1.jpg,image2.jpg,image3.jpg

We now have the CSV but there is a little bit of tidying up to do on the result to remove the comma at the start of the string and for this we will use STUFF to add an empty string in place of the first instance of the comma.

SELECT STUFF ((SELECT ',' + ImageName FROM CSVOutputDemo FOR XML PATH('')), 1, 1, '') AS ImageCSV

ImageCSV
---------------------------------
image1.jpg,image2.jpg,image3.jpg

FOR XML PATH(”) is an elegant solution but what about the alternative row by row approach?

This example uses a WHILE loop

DECLARE @ID INT
DECLARE @ImageName VARCHAR(50)
DECLARE @ImageCSV VARCHAR(500) = ''

SET @ID = (SELECT MIN(ID) FROM CSVOutputDemo)
WHILE @ID IS NOT NULL
BEGIN
  SET @ImageName = (SELECT ImageName FROM CSVOutputDemo WHERE ID = @ID)
  SET @ImageCSV = @ImageCSV + ',' + @ImageName
  SET @ID = (SELECT MIN(ID) FROM CSVOutputDemo WHERE ID > @ID)
END

SELECT STUFF(@ImageCSV, 1, 1, '') AS ImageCSV

As you can see, this is a lot more code for the same output. There is an alternate row by row solution which uses a T-SQL cursor.

DECLARE @ID INT
DECLARE @ImageName VARCHAR(50)
DECLARE @ImageCSV VARCHAR(500) = ''

DECLARE CursorImage CURSOR FOR
SELECT ImageName FROM CSVOutputDemo
OPEN CursorImage
FETCH NEXT FROM CursorImage INTO @ImageName
WHILE @@FETCH_STATUS <> -1
BEGIN
	SET @ImageCSV = @ImageCSV + ',' + @ImageName
	FETCH NEXT FROM CursorImage INTO @ImageName
END
CLOSE CursorImage
DEALLOCATE CursorImage

SELECT STUFF(@ImageCSV, 1, 1, '') AS ImageCSV

The performance comparison is best seen when STATISTICS IO is enabled.

Here are the results:

FOR XML PATH(”)

Table 'CSVOutputDemo'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

T-SQL WHILE LOOP

Table 'CSVOutputDemo'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

T-SQL CURSOR

Table 'Worktable'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CSVOutputDemo'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The set based operation involving the FOR XML PATH(”) solution makes one trip to the database however the other two solutions involve many more requests which will have a negative effect on performance.

Filed Under: All Articles, SQL Tips and Tricks Tagged With: performance, sql server, t-sql

  • « Previous Page
  • 1
  • 2
  • 3
  • 4
  • 5
  • Next Page »

Categories

  • All Articles (82)
  • Career Development (8)
  • MySQL Administration (18)
  • MySQL Performance (2)
  • SQL Server Administration (24)
  • SQL Server News (3)
  • SQL Server Performance (14)
  • SQL Server Security (3)
  • SQL Tips and Tricks (19)

Top 10 Popular Posts

  • Using sp_change_users_login to fix SQL Server orphaned users
  • How to shrink tempdb
  • MySQL SHOW USERS? – How to List All MySQL Users and Privileges
  • How to Transfer Logins to Another SQL Server or Instance
  • How to Delete Millions of Rows using T-SQL with Reduced Impact
  • T-SQL – How to Select Top N Rows for Each Group Using ROW_NUMBER()
  • New T-SQL features in SQL Server 2012 – OFFSET and FETCH
  • How to Kill All MySQL Processes For a Specific User
  • Using exec sp_who2 to help with SQL Server troubleshooting
  • How to move tempdb

Recent Posts

  • How to Setup MySQL Master Master Replication
  • How To Use SQL to Convert a STRING to an INT
  • How to set up MySQL Replication Tutorial
  • How to Use SQL CASE for Conditional Logic in Your SQL Queries
  • Using ISNULL in SQL Server to Replace NULL Values

Search

Connect

  • Twitter
  • Facebook
  • Google+
  • RSS

About

  • Cookie Policy
  • Disclaimer
  • About
Copyright © ‘2021’ DBA Diaries built on the Genesis Framework

This site uses cookies. We assume you are happy with cookies but click the link if you are not. Close