DBA Diaries

Thoughts and experiences of a DBA working with SQL Server and MySQL

T-SQL – How to Select Top N Rows for Each Group Using ROW_NUMBER()

Posted on July 11, 2015 Written by Andy Hayes 19 Comments

I’ve recently been working on a data migration project and have found myself utilizing the ROW_NUMBER() function in SQL Server a lot. This function has been around since SQL Server 2005 and at its core, provides a way to provide sequential numbering for rows returned by a query.

One of the requirements for an export process was to return the most recent two customer orders for each customer account and outputting this to CSV file.

As well as sequential numbering, ROW_NUMBER() provides a way to partition the results returned so this suited me just fine 🙂

Let’s take a look at the problem in more detail and I’m going to use our good friend the AdventureWorks database to help demonstrate.

The first query returns the data by customer and with the most recent orders for each customer at the top.

SELECT SalesOrderID
      ,OrderDate
      ,SalesOrderNumber
      ,AccountNumber
      ,CustomerID
      ,SubTotal
      ,TaxAmt
      ,TotalDue
  FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] 
ORDER BY CustomerID, OrderDate DESC

Here are the results of this query sampled.

t-sql return top n rows per group sql server

How to return the top two rows for each group in the result set?

We will use ROW_NUMBER() to place a sequential id on the results but also partition the results so that each instance of the customerID value has its sequential id reset.

SELECT SalesOrderID
      ,OrderDate
      ,SalesOrderNumber
      ,AccountNumber
      ,CustomerID
      ,SubTotal
      ,TaxAmt
      ,TotalDue
    ,ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CustomerID, OrderDate DESC) AS RowNum
FROM [AdventureWorks2012].[Sales].[SalesOrderHeader]

This returns a new column called RowNum which is providing the sequential numbering for each instance of CustomerID based on the PARTITION BY and ORDER BY of the RowNum column.

t-sql return top n rows per group sql server with row_number

All that remains is to now select the first two rows for each CustomerID. So I am wrapping it up in a CTE and returning the first two rows using the WHERE clause.

WITH MyRowSet
AS
(
SELECT SalesOrderID
      ,OrderDate
      ,SalesOrderNumber
      ,AccountNumber
      ,CustomerID
      ,SubTotal
      ,TaxAmt
      ,TotalDue
    ,ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CustomerID, OrderDate DESC) AS RowNum
FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] 
)

SELECT * FROM MyRowSet WHERE RowNum <= 2

Which returns our desired result…

t-sql return top n rows per group sql server with row_number and cte

I hope you found this post useful. 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: sql server, t-sql

Data Management – Finding, Removing Duplicate Rows Using SQL and Some Prevention Tips

Posted on July 4, 2015 Written by Andy Hayes 2 Comments

find and remove duplicate dataDuplicate data is a common problem and in my experience comes in two main flavours; exact duplicate data and similar data. Exact duplicate data is when a data source contains the same values across all fields in the record. Similar data is when some of the fields match and to the naked eye can be classed as duplicate data but in fact it isn’t – it’s similar data and therefore it is unique.

In the case of similar data, an example might be where a file has been supplied which contains names and addresses. The file has been maintained in a spreadsheet and by more than one person. For whatever reason, the same name and address has been entered twice on two rows, except one editor did not include the postal code for the address. As a result, the rows are similar but not exact duplicates of each other. Apart from this, they are duplicate data.

Typically when loading data into a database table with unique constraints, de-duping the incoming data is a necessary step. For the purposes of this article, we will cover identifying and addressing exact duplicate data.

Lets imagine a scenario where you have been asked to de-duplicate a list of item codes and the data looks like this:

ItemCode
123456789
123456789
223344789
223344789
987554321
987554321

CREATE TABLE #ItemCodes
(
  ItemCode VARCHAR(50)
);

INSERT INTO #ItemCodes
VALUES
('123456789'),
('123456789'),
('223344789'),
('223344789'),
('987554321'),
('987554321');

Identifying Duplicate Rows Using SQL

The first step is to find the duplicate data. This can be easily achieved using a simple select query in combination GROUP BY and HAVING. If we loaded this data into a staging table with no constraints, we could proceed to query the data like this:

SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM #ItemCodes
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1

The resulting output reduces our 6 records to 3 because the the HAVING clause only returns the records that have more than one instance of ItemCode occurring.

How to De-duplicate Duplicate Data

Now that we have identified the duplicate data, we can do something about it. We also know how many duplicates we are dealing with. In the instance of a 6 row table, it could be easy just to manually remove the rows 🙂 However these tasks are typically more complicated involving many more rows than this.

The first step is to make the data unique and the easiest way to do this is add some unique identity column to the data. If the duplicate data is already in a table in production, adding a unique identifier is still the way to go but this might mean exporting the data first to a new table, deleting it from the source table and then re-importing after de-duplication has happened.

If the data is being loaded into a staging table then it is easy enough to add a column for the unique identity.

ALTER TABLE #ItemCodes 
ADD ID INT IDENTITY(1,1) NOT NULL;

Querying the data now, it is unique:

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
123456789                                          2
223344789                                          3
223344789                                          4
987554321                                          5
987554321                                          6

(6 row(s) affected)

Lets now delete that duplicate data!

We’re ready to remove the duplicate data from our table. This is done using a DELETE statement and a self join, like this:

DELETE t1
FROM #ItemCodes t1
JOIN #ItemCodes t2 
ON t1.ItemCode = t2.ItemCode 
AND t1.ID > t2.ID;

(3 row(s) affected)

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
223344789                                          3
987554321                                          5

(3 row(s) affected)

We’re now left with 3 rows, instead of 6. So our ItemCodes are now unique and can be loaded into our main database table. As you can see, ID’s 2, 4 and 6 were removed within the JOIN where t1.ID > t2.ID

Beware of lots of data and performance degeneration!

In an older post, I wrote about deleting rows from a table in a way which did not negatively effect performance too much. It’s worth a read as it illustrates what can happen if large transactions are issued in order to remove data. If you are dealing with large volumes of data, consider how to break the delete operation down into smaller parts to reduce the impact on your system if other users are querying the server.

Prevention is better than the cure!

Duplicate data can cost your business a lot of money in lost productivity time and incorrect results. Cleaning up the data is an expensive operation and can be ongoing until the source of the problem is addressed. Rather than continually repeating de-duplication tasks, communicate to the business the need to tackle the problem at the source.

As I referred to earlier, there are two forms of duplicate data, similar and exact. Preventing exact duplicate data means looking at your database and ensuring that the correct constraints are in place to prevent duplicate values. Such constraints being primary keys and unique column constraints.

When tackling similar duplicate data this can be a combination of ensuring that there are the appropriate constraints in place at both the database and application layers. For example if you have invested in a CRM system, whether written in-house of purchased off the shelf/cloud, make sure that it has sufficient duplicate detectors in place to at least warn the user that when entering a record, that there is a chance that the record is a duplicate.

If you have an application that has been written in-house for some function such as customer entry, ensure that there is address lookup functionality to help ensure that address information is entered in a clean way to help minimize the chances of mistyped address data, resulting in similar data.

Be proactive!

Similar duplicate data can creep into your system because it passes the application and database checks. I refer back to the example earlier in the post when a customer record existed twice, one with and one without a postal code. It’s important to realize early that a proactive approach to managing the duplicates can reap long term benefits, reducing overall cost to the business because the duplicates are addressed in a proactive way.

Some systems will provide reports which can help you identify duplicate data. Alternatively, you can run some SQL to identify duplicate rows in your tables and take the appropriate steps. Lets go back to our earlier example where we utilized the GROUP BY HAVING SQL but this time we will enhance it.

There is a table used in this query to contain items with descriptions, when they were added and by who. The SQL below shows the results of the GROUP BY and performs a JOIN back to the Items table to retrieve all fields for rows which have duplicate item codes. This helps provide a view of similar data with duplicate item codes.

CREATE TABLE Items
(
  ItemCode VARCHAR(50)
, DateAdded DATETIME
, Item VARCHAR(50)
, AddedByUser VARCHAR(10)
);

WITH Dupes
AS
(SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM Items
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1)

SELECT * 
FROM Items
JOIN Dupes 
ON Dupes.ItemCode = Items.ItemCode;

I hope you find this post useful. 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: data management, sql, sql server, t-sql

T-SQL – Determine the Difference in Values Between Columns in Different Rows

Posted on June 21, 2015 Written by Andy Hayes Leave a Comment

Have you ever needed to calculate the difference between columns in two or more rows in a table using a T-SQL query? For example you might want to run some trend analysis or audit data for changes. Thankfully using ROW_NUMBER() and Common Table Expressions (CTE’s), this can easily be achieved.

--create our table
DECLARE @TableRows TABLE
(
DateAdded SMALLDATETIME,
Points TINYINT
);

--insert some data
INSERT INTO @TableRows
VALUES
('20150615 15:11:05',10),
('20150615 16:04:23',15),
('20150615 16:23:01',21),
('20150615 16:30:50',22),
('20150615 17:04:07',30);

--set up a CTE which we will perform a self join on
WITH ExampleCTE
AS
(SELECT 
	  ROW_NUMBER() OVER(ORDER BY DateAdded) AS RowNum
	, DateAdded
	, Points
FROM @TableRows)

--now query the CTE using the self join to get our result
SELECT 
	  t1.DateAdded
	, t2.DateAdded
	, t1.Points
	, t2.Points
	, t1.Points - t2.Points AS PointsDifference
FROM ExampleCTE t1
LEFT JOIN ExampleCTE t2 ON T1.RowNum = T2.RowNum + 1

The resulting output above shows the calculated column “PointsDifference” containing the difference between the current row and the row previous to it.

Calculate Difference Between Columns in Different Rows

The magic happens in the LEFT JOIN part of the query. By adding 1 to the value of the RowNum column, the resulting calculation within the join allows the analysis between the “Points” columns to be accurately returned.

So simple, so effective. I love T-SQL! 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: sql server, sql server 2008, sql server 2012, t-sql

SQL – Using CASE with ORDER BY in Stored Procedures

Posted on May 31, 2015 Written by Andy Hayes 2 Comments

When writing SQL it is possible to use ORDER BY with CASE conditions to help provide more flexibility over the returned order of a record set and this is especially useful when writing stored procedures which require a dynamic ORDER BY clause. This method does not require any dynamic SQL build up strings executed either using EXEC or sp_executeSQL

Lets take this example using the AdventureWorks database that is downloadable for SQL Server.

This procedure which is designed to accept three parameters which will help return records for a sales person  and in a specified column order and in a particular sort order.

CREATE PROCEDURE usp_GetOrders @SalesPersonID INT, @OrderBy VARCHAR(20), @SortOrder CHAR(4)
AS
SELECT SalesOrderNumber, PurchaseOrderNumber, TotalDue, OrderDate, DueDate, AccountNumber  
FROM Sales.SalesOrderHeader
WHERE SalesPersonID = @SalesPersonID
ORDER BY 
  CASE WHEN @OrderBy = 'Due Date' AND @SortOrder = 'DESC' THEN DueDate END DESC,
  CASE WHEN @OrderBy = 'Due Date' THEN DueDate END,
  CASE WHEN @OrderBy = 'Order Date' AND @SortOrder = 'DESC' THEN OrderDate END DESC,
  CASE WHEN @OrderBy = 'Order Date' THEN OrderDate END,
  CASE WHEN @OrderBy = 'Total Due' AND @SortOrder = 'DESC' THEN TotalDue END DESC,
  CASE WHEN @OrderBy = 'Total Due' THEN TotalDue END

One thing you might notice, I have supplied the DESC version of the CASE first for each column. There are two reasons for this; the default sort order if often ASC so I have not supplied an ASC for second CASE for each column but in doing this (the second reason), the CASE will not work correctly if the DESC is the second line as the first line will always return true.

So this:

CASE WHEN @OrderBy = 'Due Date' THEN DueDate END,  
CASE WHEN @OrderBy = 'Due Date' AND @SortOrder = 'DESC' THEN DueDate END DESC,

Will always return the data in ASC order regardless of whether I supply DESC in @SortOrder. In this case either make DESC the first line as written or re-write as:

CASE WHEN @OrderBy = 'Due Date' AND @SortOrder = 'ASC' THEN DueDate END,  
CASE WHEN @OrderBy = 'Due Date' AND @SortOrder = 'DESC' THEN DueDate END DESC,

I hope you found this useful. If you know of better ways to achieve this, please leave a comment. I’d love to hear from you! 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: sql server, t-sql

  • « Previous Page
  • 1
  • 2
  • 3
  • 4
  • 5
  • Next Page »

Categories

  • All Articles (82)
  • Career Development (8)
  • MySQL Administration (18)
  • MySQL Performance (2)
  • SQL Server Administration (24)
  • SQL Server News (3)
  • SQL Server Performance (14)
  • SQL Server Security (3)
  • SQL Tips and Tricks (19)

Top 10 Popular Posts

  • Using sp_change_users_login to fix SQL Server orphaned users
  • How to shrink tempdb
  • MySQL SHOW USERS? – How to List All MySQL Users and Privileges
  • How to Transfer Logins to Another SQL Server or Instance
  • How to Delete Millions of Rows using T-SQL with Reduced Impact
  • T-SQL – How to Select Top N Rows for Each Group Using ROW_NUMBER()
  • New T-SQL features in SQL Server 2012 – OFFSET and FETCH
  • How to Kill All MySQL Processes For a Specific User
  • Using exec sp_who2 to help with SQL Server troubleshooting
  • How to move tempdb

Recent Posts

  • How to Setup MySQL Master Master Replication
  • How To Use SQL to Convert a STRING to an INT
  • How to set up MySQL Replication Tutorial
  • How to Use SQL CASE for Conditional Logic in Your SQL Queries
  • Using ISNULL in SQL Server to Replace NULL Values

Search

Connect

  • Twitter
  • Facebook
  • Google+
  • RSS

About

  • Cookie Policy
  • Disclaimer
  • About
Copyright © ‘2021’ DBA Diaries built on the Genesis Framework

This site uses cookies. We assume you are happy with cookies but click the link if you are not. Close