DBA Diaries

Thoughts and experiences of a DBA working with SQL Server and MySQL

Using SQL COALESCE to Find the First Non-NULL Value

Posted on May 10, 2017 Written by Andy Hayes Leave a Comment

You are looking to find a way to find the first non-null value from a list of fields. In this post, we look at SQL COALESCE – a wonderfully useful tool that helps to solve that problem.

To help demonstrate this, we will talk about a hypothetical scenario where a number of customer leads have been added to a table from different sources and the consistency of data found in some of the key fields is sparse.

Here is the data, the table is called “Leads”

SQL COALESCE example

For the purposes of this example, the task is to pull out a single column list of phone numbers for the leads in the table and that involves finding the first non-null value.

Here is what you do:

SELECT 
  COALESCE(HomePhone, BusinessPhone, MobilePhone) AS Phone 
FROM Leads;

This produces a list:

SQL COALESCE result

You can see that it has taken the first phone number (non-null value) found based on the order of columns passed into COALESCE.

It is possible to write this in another way using SELECT CASE

SELECT 
 CASE WHEN HomePhone IS NOT NULL THEN HomePhone
      WHEN BusinessPhone IS NOT NULL THEN BusinessPhone
      WHEN MobilePhone IS NOT NULL THEN MobilePhone
   END AS Phone
FROM Leads;

The result is the same but it takes longer to write the code so using COALESCE in your SQL statement is certainly more efficient in this case – less is more! 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: mysql, sql, sql server, t-sql

Data Management – Finding, Removing Duplicate Rows Using SQL and Some Prevention Tips

Posted on July 4, 2015 Written by Andy Hayes 2 Comments

find and remove duplicate dataDuplicate data is a common problem and in my experience comes in two main flavours; exact duplicate data and similar data. Exact duplicate data is when a data source contains the same values across all fields in the record. Similar data is when some of the fields match and to the naked eye can be classed as duplicate data but in fact it isn’t – it’s similar data and therefore it is unique.

In the case of similar data, an example might be where a file has been supplied which contains names and addresses. The file has been maintained in a spreadsheet and by more than one person. For whatever reason, the same name and address has been entered twice on two rows, except one editor did not include the postal code for the address. As a result, the rows are similar but not exact duplicates of each other. Apart from this, they are duplicate data.

Typically when loading data into a database table with unique constraints, de-duping the incoming data is a necessary step. For the purposes of this article, we will cover identifying and addressing exact duplicate data.

Lets imagine a scenario where you have been asked to de-duplicate a list of item codes and the data looks like this:

ItemCode
123456789
123456789
223344789
223344789
987554321
987554321

CREATE TABLE #ItemCodes
(
  ItemCode VARCHAR(50)
);

INSERT INTO #ItemCodes
VALUES
('123456789'),
('123456789'),
('223344789'),
('223344789'),
('987554321'),
('987554321');

Identifying Duplicate Rows Using SQL

The first step is to find the duplicate data. This can be easily achieved using a simple select query in combination GROUP BY and HAVING. If we loaded this data into a staging table with no constraints, we could proceed to query the data like this:

SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM #ItemCodes
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1

The resulting output reduces our 6 records to 3 because the the HAVING clause only returns the records that have more than one instance of ItemCode occurring.

How to De-duplicate Duplicate Data

Now that we have identified the duplicate data, we can do something about it. We also know how many duplicates we are dealing with. In the instance of a 6 row table, it could be easy just to manually remove the rows 🙂 However these tasks are typically more complicated involving many more rows than this.

The first step is to make the data unique and the easiest way to do this is add some unique identity column to the data. If the duplicate data is already in a table in production, adding a unique identifier is still the way to go but this might mean exporting the data first to a new table, deleting it from the source table and then re-importing after de-duplication has happened.

If the data is being loaded into a staging table then it is easy enough to add a column for the unique identity.

ALTER TABLE #ItemCodes 
ADD ID INT IDENTITY(1,1) NOT NULL;

Querying the data now, it is unique:

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
123456789                                          2
223344789                                          3
223344789                                          4
987554321                                          5
987554321                                          6

(6 row(s) affected)

Lets now delete that duplicate data!

We’re ready to remove the duplicate data from our table. This is done using a DELETE statement and a self join, like this:

DELETE t1
FROM #ItemCodes t1
JOIN #ItemCodes t2 
ON t1.ItemCode = t2.ItemCode 
AND t1.ID > t2.ID;

(3 row(s) affected)

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
223344789                                          3
987554321                                          5

(3 row(s) affected)

We’re now left with 3 rows, instead of 6. So our ItemCodes are now unique and can be loaded into our main database table. As you can see, ID’s 2, 4 and 6 were removed within the JOIN where t1.ID > t2.ID

Beware of lots of data and performance degeneration!

In an older post, I wrote about deleting rows from a table in a way which did not negatively effect performance too much. It’s worth a read as it illustrates what can happen if large transactions are issued in order to remove data. If you are dealing with large volumes of data, consider how to break the delete operation down into smaller parts to reduce the impact on your system if other users are querying the server.

Prevention is better than the cure!

Duplicate data can cost your business a lot of money in lost productivity time and incorrect results. Cleaning up the data is an expensive operation and can be ongoing until the source of the problem is addressed. Rather than continually repeating de-duplication tasks, communicate to the business the need to tackle the problem at the source.

As I referred to earlier, there are two forms of duplicate data, similar and exact. Preventing exact duplicate data means looking at your database and ensuring that the correct constraints are in place to prevent duplicate values. Such constraints being primary keys and unique column constraints.

When tackling similar duplicate data this can be a combination of ensuring that there are the appropriate constraints in place at both the database and application layers. For example if you have invested in a CRM system, whether written in-house of purchased off the shelf/cloud, make sure that it has sufficient duplicate detectors in place to at least warn the user that when entering a record, that there is a chance that the record is a duplicate.

If you have an application that has been written in-house for some function such as customer entry, ensure that there is address lookup functionality to help ensure that address information is entered in a clean way to help minimize the chances of mistyped address data, resulting in similar data.

Be proactive!

Similar duplicate data can creep into your system because it passes the application and database checks. I refer back to the example earlier in the post when a customer record existed twice, one with and one without a postal code. It’s important to realize early that a proactive approach to managing the duplicates can reap long term benefits, reducing overall cost to the business because the duplicates are addressed in a proactive way.

Some systems will provide reports which can help you identify duplicate data. Alternatively, you can run some SQL to identify duplicate rows in your tables and take the appropriate steps. Lets go back to our earlier example where we utilized the GROUP BY HAVING SQL but this time we will enhance it.

There is a table used in this query to contain items with descriptions, when they were added and by who. The SQL below shows the results of the GROUP BY and performs a JOIN back to the Items table to retrieve all fields for rows which have duplicate item codes. This helps provide a view of similar data with duplicate item codes.

CREATE TABLE Items
(
  ItemCode VARCHAR(50)
, DateAdded DATETIME
, Item VARCHAR(50)
, AddedByUser VARCHAR(10)
);

WITH Dupes
AS
(SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM Items
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1)

SELECT * 
FROM Items
JOIN Dupes 
ON Dupes.ItemCode = Items.ItemCode;

I hope you find this post useful. 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: data management, sql, sql server, t-sql

T-SQL – Derived Tables Demonstrated and Explained

Posted on October 18, 2014 Written by Andy Hayes Leave a Comment

T-SQL is a very powerful language when dealing with sets and in this post, I am going to be demonstrating how to use derived tables.

Derived tables are not the same as a temporary or physical tables where the fields and data types are declared. Nor are they subqueries. Instead, as the name implies, the table is derived from an existing table. I liken them to views but a view is an object that is created and is re-usable. A derived table is only available at execution time and not a re-usable object within the query. In other words, you cannot reference it multiple times like you can with common table expressions (CTE’s).

Basic syntax of the derived table

SELECT Field1, Field2
FROM
(SELECT *
FROM Table) Table1

So lets put the syntax to something more real:

USE AdventureWorks2012
GO
SELECT TOP 10 FirstName, LastName
FROM
(SELECT * FROM Person.Person) AS PersonDerivedTable
WHERE LastName = 'Smith';
FirstName       LastName
--------------- ---------------
Abigail         Smith
Adriana         Smith
Alexander       Smith
Alexandra       Smith
Alexis          Smith
Allen           Smith
Alyssa          Smith
Andre           Smith
Andrew          Smith
Arthur          Smith

(10 row(s) affected)

You might be thinking that you can achieve the same thing without the derived table and you would be correct. However where derived tables start to become really useful are when you want to start using aggregated statements.

Derived tables used with aggregated queries

For example, the following query aggregates total sales orders by year as one query and combines the results of the total sales orders per sales rep, per year of the second query. The purpose is to get the percentage of sales orders by sales rep for each year.

USE AdventureWorks2012
GO

SELECT YearTotalsBySalesPerson.*
, YearTotals.TotalOrdersForYear
, CONVERT(DECIMAL(10,2), 100 * CONVERT(DECIMAL(10,2), YearTotalsBySalesPerson.SalesPersonTotalOrders)
/ CONVERT(DECIMAL(10,2), YearTotals.TotalOrdersForYear)) AS PercentageOfSalesForYear
FROM
--total sales orders per year
(SELECT YEAR(OrderDate) AS [Year]
, COUNT(SalesOrderID) AS TotalOrdersForYear
FROM Sales.SalesOrderHeader SOH
WHERE SalesPersonID IS NOT NULL
GROUP BY YEAR(OrderDate)) YearTotals

JOIN
--total sales orders by sales rep per year
(SELECT YEAR(OrderDate) AS [Year]
, SalesPersonID
, COUNT(SalesOrderID) AS SalesPersonTotalOrders
FROM Sales.SalesOrderHeader
WHERE SalesPersonID IS NOT NULL
GROUP BY YEAR(OrderDate), SalesPersonID) YearTotalsBySalesPerson

ON YearTotals.[Year] = YearTotalsBySalesPerson.[Year]

Returning the results (sample)

Year        SalesPersonID SalesPersonTotalOrders TotalOrdersForYear PercentageOfSalesForYear
----------- ------------- ---------------------- ------------------ ---------------------------------------
2008        283           41                     901                4.55
2008        277           97                     901                10.77
2006        289           84                     1015               8.28
2007        279           158                    1524               10.37
2006        278           69                     1015               6.80
2007        285           9                      1524               0.59
2006        275           139                    1015               13.69
2008        286           56                     901                6.22
2005        274           4                      366                1.09
2008        280           12                     901                1.33
2005        282           51                     366                13.93

Derived tables can simplify code writing

Take the following example query which uses a CASE statement combined with GROUP BY

SELECT YEAR(OrderDate) AS [Year]
,CASE WHEN MONTH(OrderDate) IN (1,2,3) THEN 1
WHEN MONTH(OrderDate) IN (4,5,6) THEN 2
WHEN MONTH(OrderDate) IN (7,8,9) THEN 3
WHEN MONTH(OrderDate) IN (10,11,12) THEN 4 END AS [Quarter]
,COUNT(SalesOrderID) AS Orders
FROM Sales.SalesOrderHeader
GROUP BY
YEAR(OrderDate)
,CASE WHEN MONTH(OrderDate) IN (1,2,3) THEN 1
WHEN MONTH(OrderDate) IN (4,5,6) THEN 2
WHEN MONTH(OrderDate) IN (7,8,9) THEN 3
WHEN MONTH(OrderDate) IN (10,11,12) THEN 4 END
ORDER BY [Year], [Quarter]

This can simplified with a derived table.

SELECT [Year], [Quarter], COUNT(SalesOrderID) AS Orders
FROM
(SELECT YEAR(OrderDate) AS [Year]
,CASE WHEN MONTH(OrderDate) IN (1,2,3) THEN 1
WHEN MONTH(OrderDate) IN (4,5,6) THEN 2
WHEN MONTH(OrderDate) IN (7,8,9) THEN 3
WHEN MONTH(OrderDate) IN (10,11,12) THEN 4 END AS [Quarter]
,SalesOrderID
FROM Sales.SalesOrderHeader) YearQuarter
GROUP BY [Year], [Quarter]
ORDER BY [Year], [Quarter]

You can see that there is no need to write the CASE statement twice for the second version.

Summary

  • A derived table cannot be re-used within a query. Consider using a CTE or view in that instance.
  • A derived table is not a physical object within the database, it only exists at execution time.
  • Derived tables are not exclusive to SQL Server, other RDBMS systems support them, MySQL, Oracle, etc.
  • They are great for performing extra calculations on aggregated queries by enabling you to join sets of data together.
  • Can be used to help simply code writing, for example GROUP BY CASE WHEN statements as above.

Filed Under: All Articles, SQL Tips and Tricks Tagged With: sql, sql server, t-sql

How to Determine SQL Server Version Number In Various Ways

Posted on September 1, 2012 Written by Andy Hayes 1 Comment

determine sql server versionIt’s important to be able to determine SQL Server version number. DBA’s need to know what SQL Server version numbers they are supporting so that during patching windows, they can source and apply the correct patches or service packs.

This information is useful as a reference to what issues may be contained in a certain build that could affect production. The DBA can quickly reference the latest build list with their current SQL Server version number and work out where their build falls on the list and which patches to apply.

Here is the latest SQL Server 2012 build list which also contains a link to lists for previous versions. Please note that you have to register on SQL Server Central if you are not already a member to view this. (I’d recommend that you do register as it is a brilliant forum 🙂 )

Ways to determine SQL Server version number

All of these methods are straightforward so lets take a look.

Method 1 – “SELECT @@VERSION”

Simply open up a new query and run the following:

SELECT @@VERSION

The output will look something like:

Microsoft SQL Server 2012 - 11.0.2100.60 (X64) 
	Feb 10 2012 19:39:15 
	Copyright (c) Microsoft Corporation
	Express Edition (64-bit) on Windows NT 6.1  (Build 7601: Service Pack 1)

Method 2 – “xp_msver”

I like this one as it provides quite a lot of useful information about your SQL Server installation and server configuration

EXEC xp_msver

Sample output from my SQL Server 2012 install…
determine sql server version number

Method 3 – “SELECT SERVERPROPERTY”

You run SELECT SERVERPROPERTY against your server and pass in some parameters to get back the information you want.

SELECT
SERVERPROPERTY('productversion') AS ProductVersion
, SERVERPROPERTY ('productlevel') AS ProductLevel
, SERVERPROPERTY ('edition') AS ProductEdition

Which produces the following output…

ProductVersion       ProductLevel         ProductEdition
-------------------- -------------------- ------------------------------
11.0.2100.60         RTM                  Express Edition (64-bit)

(1 row(s) affected)

Method 4 – Use Management Studio (Client Tools)

You can access the Help menu in Management Studio and goto “About” to ascertain which SQL Server versions of the client tools you have installed on your workstation or SQL Server.

Summary – How to determine SQL Server version

So there are some ways to find out which versions of your client tools and server engine you are running. There may be others and if you know of them, please do leave a comment.

Filed Under: All Articles, SQL Server Administration Tagged With: sql, sql server

Categories

  • All Articles (82)
  • Career Development (8)
  • MySQL Administration (18)
  • MySQL Performance (2)
  • SQL Server Administration (24)
  • SQL Server News (3)
  • SQL Server Performance (14)
  • SQL Server Security (3)
  • SQL Tips and Tricks (19)

Top 10 Popular Posts

  • Using sp_change_users_login to fix SQL Server orphaned users
  • How to shrink tempdb
  • MySQL SHOW USERS? – How to List All MySQL Users and Privileges
  • How to Transfer Logins to Another SQL Server or Instance
  • How to Delete Millions of Rows using T-SQL with Reduced Impact
  • T-SQL – How to Select Top N Rows for Each Group Using ROW_NUMBER()
  • New T-SQL features in SQL Server 2012 – OFFSET and FETCH
  • How to Kill All MySQL Processes For a Specific User
  • Using exec sp_who2 to help with SQL Server troubleshooting
  • How to move tempdb

Recent Posts

  • How to Setup MySQL Master Master Replication
  • How To Use SQL to Convert a STRING to an INT
  • How to set up MySQL Replication Tutorial
  • How to Use SQL CASE for Conditional Logic in Your SQL Queries
  • Using ISNULL in SQL Server to Replace NULL Values

Search

Connect

  • Twitter
  • Facebook
  • Google+
  • RSS

About

  • Cookie Policy
  • Disclaimer
  • About
Copyright © ‘2021’ DBA Diaries built on the Genesis Framework

This site uses cookies. We assume you are happy with cookies but click the link if you are not. Close