GROUPING SETS performance versus UNION performance

GROUPING SETS is a way of aggregating data to produce a result set containing different sets of columns.

For the SQL Server platform, it was first introduced in SQL 2008. Prior to SQL Server 2008, achieving this type of result set was possible using UNION, WITH CUBE, WITH ROLLUP and a helper function called GROUPING to achieve similar results.

The issues with WITH CUBE and WITH ROLLUP were that they were both non standard and not flexible enough. GROUPING SETS is ISO standard, simplifies writing these queries and performs better in comparison to UNION, also an ISO standard.

To help show this, I will use the table “Staff” which I created in a previous post on views and modify the table to include a region identifier relating to a new table called “Region”. I will also add data to the Staff table as well as modifying some of the salary values.

Modification of the view “vwStaff” which I set up in that post is required to include some new columns. The data in the view now looks like this and I will be wanting to get some aggregations based on average salary from this data.

SELECT * FROM vwStaff

StaffID     FirstName  LastName   Department Region     Salary
----------- ---------- ---------- ---------- ---------- -----------
1           John       Smith      Sales      South East 45000
2           Raymond    Jones      Finance    South East 40000
3           Tracey     Carter     Technology South East 42000
4           Andy       Milward    Sales      South West 41000
5           Joanne     Baldock    Finance    South West 37000
6           Simon      Jackson    Technology South West 40000
7           Greg       Benson     Sales      Midlands   47500
8           Jack       Andrews    Finance    Midlands   39000
9           James      Michaels   Technology Midlands   38000
10          Mark       Taylor     Sales      North West 39000
11          Frances    Keagan     Finance    North West 42000
12          Jane       Coleson    Technology North West 35000
13          Jennifer   Seymour    Sales      North East 35500
14          Rebecca    Morgan     Finance    North East 38000
15          Alex       Parker     Technology North East 38500

(15 row(s) affected)

To aggregate the data using GROUPING SETS, I have built this query below and you can see from the list that there are multiple result sets in the one list as defined in the GROUPING SETS part of the query. The “()” produces an aggregate for the entire result set based on the Salary column.

SELECT
Department
, Region
, AVG(Salary) AS AverageSalary
FROM vwStaff
GROUP BY
GROUPING SETS
(
, (Department,Region)
, (Department)
, (Region)
()
)

Which returns the following results

Department      Region          AverageSalary
--------------- --------------- -----------
Finance         Midlands        39000
Sales           Midlands        47500
Technology      Midlands        38000
NULL            Midlands        41500
Finance         North East      38000
Sales           North East      35500
Technology      North East      38500
NULL            North East      37333
Finance         North West      42000
Sales           North West      39000
Technology      North West      35000
NULL            North West      38666
Finance         South East      40000
Sales           South East      45000
Technology      South East      42000
NULL            South East      42333
Finance         South West      37000
Sales           South West      41000
Technology      South West      40000
NULL            South West      39333
NULL            NULL            39833
Finance         NULL            39200
Sales           NULL            41600
Technology      NULL            38700

(24 row(s) affected)

Not to be confused with WITH CUBE, the same query could be written using GROUP BY CUBE (also ISO compliant)

SELECT
Department
, Region
, AVG(Salary) AS AverageSalary
FROM vwStaff
GROUP BY
CUBE
(
Region
, Department
)

GROUP BY CUBE works by accepting a list of elements and then defining all possible GROUPING SETS from that supplied list. Personally I use GROUPING SETS but it is a matter of preference.

Before SQL 2008, this query would have been written using UNION like this

SELECT
Department
, Region
, AVG(Salary) AS AverageSalary
FROM vwStaff
GROUP BY
Department
, Region
UNION
SELECT
Department
, NULL AS Region
, AVG(Salary) AS AverageSalary
FROM vwStaff
GROUP BY
Department
UNION
SELECT
NULL AS Department
, Region
, AVG(Salary) AS AverageSalary
FROM vwStaff
GROUP BY
Region
UNION
SELECT
NULL AS Department
, NULL AS Region
, AVG(Salary) AS AverageSalary
FROM vwStaff

In regard to how both queries perform and without even looking at any performance statistics or execution plans, the obvious thing to note about the query using UNION is that there multiple calls to the base data as opposed to one call in the GROUPING SETS query. So you could estimate that there would be a 75% improvement in IO performance when using GROUPING SETS but the reality is not quite that exact.

To get some idea of the differences between the two, there are a couple of useful ways to draw comparisons

SET STATISTICS IO ON – tells SQL Server to display the amount of disk activity generated which produced these results

---GROUPING SETS
Table 'Department'. Scan count 0, logical reads 60, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Region'. Scan count 0, logical reads 60, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Staff'. Scan count 2, logical reads 4, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

---UNION
Table 'Department'. Scan count 0, logical reads 60, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0
Table 'Region'. Scan count 0, logical reads 120, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Staff'. Scan count 4, logical reads 8, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

It is worth nothing that the number of scans (index or table scans) is greater in the UNION statement with 50% more scans required than the GROUPING SETS operation. 50% less scans? That’s a nice performance gain.

Click the following links to see that the execution plan of the UNION query is considerably larger than the GROUPING SETS plan.

GROUPING SETS versus UNION conclusion

I can’t think of a reason not to modify any existing UNION code across to GROUPING SETS. It will simplify your code and perform faster. For large datasets, considerable performance benefits could be seen.

GROUPING SETS versus UNION conclusion

Related Posts:

About Andy Hayes

Reader Interactions

Comments

Leave a Reply Cancel reply