GROUPING SETS is a way of aggregating data to produce a result set containing different sets of columns.
For the SQL Server platform, it was first introduced in SQL 2008. Prior to SQL Server 2008, achieving this type of result set was possible using UNION, WITH CUBE, WITH ROLLUP and a helper function called GROUPING to achieve similar results.
The issues with WITH CUBE and WITH ROLLUP were that they were both non standard and not flexible enough. GROUPING SETS is ISO standard, simplifies writing these queries and performs better in comparison to UNION, also an ISO standard.
To help show this, I will use the table “Staff” which I created in a previous post on views and modify the table to include a region identifier relating to a new table called “Region”. I will also add data to the Staff table as well as modifying some of the salary values.
Modification of the view “vwStaff” which I set up in that post is required to include some new columns. The data in the view now looks like this and I will be wanting to get some aggregations based on average salary from this data.
SELECT * FROM vwStaff
StaffID FirstName LastName Department Region Salary ----------- ---------- ---------- ---------- ---------- ----------- 1 John Smith Sales South East 45000 2 Raymond Jones Finance South East 40000 3 Tracey Carter Technology South East 42000 4 Andy Milward Sales South West 41000 5 Joanne Baldock Finance South West 37000 6 Simon Jackson Technology South West 40000 7 Greg Benson Sales Midlands 47500 8 Jack Andrews Finance Midlands 39000 9 James Michaels Technology Midlands 38000 10 Mark Taylor Sales North West 39000 11 Frances Keagan Finance North West 42000 12 Jane Coleson Technology North West 35000 13 Jennifer Seymour Sales North East 35500 14 Rebecca Morgan Finance North East 38000 15 Alex Parker Technology North East 38500 (15 row(s) affected)
To aggregate the data using GROUPING SETS, I have built this query below and you can see from the list that there are multiple result sets in the one list as defined in the GROUPING SETS part of the query. The “()” produces an aggregate for the entire result set based on the Salary column.
SELECT Department , Region , AVG(Salary) AS AverageSalary FROM vwStaff GROUP BY GROUPING SETS ( , (Department,Region) , (Department) , (Region) () )
Which returns the following results
Department Region AverageSalary --------------- --------------- ----------- Finance Midlands 39000 Sales Midlands 47500 Technology Midlands 38000 NULL Midlands 41500 Finance North East 38000 Sales North East 35500 Technology North East 38500 NULL North East 37333 Finance North West 42000 Sales North West 39000 Technology North West 35000 NULL North West 38666 Finance South East 40000 Sales South East 45000 Technology South East 42000 NULL South East 42333 Finance South West 37000 Sales South West 41000 Technology South West 40000 NULL South West 39333 NULL NULL 39833 Finance NULL 39200 Sales NULL 41600 Technology NULL 38700 (24 row(s) affected)
Not to be confused with WITH CUBE, the same query could be written using GROUP BY CUBE (also ISO compliant)
SELECT Department , Region , AVG(Salary) AS AverageSalary FROM vwStaff GROUP BY CUBE ( Region , Department )
GROUP BY CUBE works by accepting a list of elements and then defining all possible GROUPING SETS from that supplied list. Personally I use GROUPING SETS but it is a matter of preference.
Before SQL 2008, this query would have been written using UNION like this
SELECT Department , Region , AVG(Salary) AS AverageSalary FROM vwStaff GROUP BY Department , Region UNION SELECT Department , NULL AS Region , AVG(Salary) AS AverageSalary FROM vwStaff GROUP BY Department UNION SELECT NULL AS Department , Region , AVG(Salary) AS AverageSalary FROM vwStaff GROUP BY Region UNION SELECT NULL AS Department , NULL AS Region , AVG(Salary) AS AverageSalary FROM vwStaff
In regard to how both queries perform and without even looking at any performance statistics or execution plans, the obvious thing to note about the query using UNION is that there multiple calls to the base data as opposed to one call in the GROUPING SETS query. So you could estimate that there would be a 75% improvement in IO performance when using GROUPING SETS but the reality is not quite that exact.
To get some idea of the differences between the two, there are a couple of useful ways to draw comparisons
SET STATISTICS IO ON – tells SQL Server to display the amount of disk activity generated which produced these results
---GROUPING SETS Table 'Department'. Scan count 0, logical reads 60, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Region'. Scan count 0, logical reads 60, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Staff'. Scan count 2, logical reads 4, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. ---UNION Table 'Department'. Scan count 0, logical reads 60, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0 Table 'Region'. Scan count 0, logical reads 120, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Staff'. Scan count 4, logical reads 8, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
It is worth nothing that the number of scans (index or table scans) is greater in the UNION statement with 50% more scans required than the GROUPING SETS operation. 50% less scans? That’s a nice performance gain.
Click the following links to see that the execution plan of the UNION query is considerably larger than the GROUPING SETS plan.
GROUPING SETS versus UNION conclusion
I can’t think of a reason not to modify any existing UNION code across to GROUPING SETS. It will simplify your code and perform faster. For large datasets, considerable performance benefits could be seen.
Filipe says
Very nice, congrats!!