DBA Diaries

Thoughts and experiences of a DBA working with SQL Server and MySQL

Andy Hayes is a DBA working with SQL Server since version 7.0. He has a wonderful wife and two beautiful children. He loves database technology, playing cricket, and blogging. He is passionate about sharing his experiences as a DBA and learning more to further his understanding and knowledge. You can follow me on Twitter, check out my Facebook page or follow me on Google+

MySQL Replication Use Cases

Posted on February 6, 2016 Written by Andy Hayes Leave a Comment

In my last post, I summarized when MySQL replication is typically implemented. I now want to expand on that and talk about MySQL replication use cases.

Scaling out using MySQL Replication

Let’s take a typical example, the application is read-heavy versus writes. The server is under heavy read load and you want to reduce that. You have optimized indexes etc all you can but the database is still struggling to keep up with the demand. To alleviate the load, more read requests per second need to be processed.

You now have to make a decision, do you scale vertically with a larger server and hope the problem gets fixed this way or do you scale horizontally and add extra servers?

Scaling vertically with a bigger server may address the problem. More CPU’s, faster disks etc could well remove the bottleneck, however this involves a migration exercise from the old to the new server which can be a lengthy operation.

If the application is supporting a business which is experiencing growth and therefore increases in load, repeating the migrations each time (assuming that the hardware to support it has been invented that is faster than what you already have) is not a viable long term solution.

By scaling horizontally and implementing MySQL Replication, you can add additional servers and load balance read requests across them. Want more read throughput? You can add more replica’s.

I’ve used a read-heavy application as an example here. If you wanted more write throughput, then MySQL replication is not a solution to this problem and some form of sharding/HA cluster solution would be used.

Reducing Impact of Backups Taken By Mysqldump

If you use the out of the box tool for taking backups of your database – mysqldump, you may have noticed that it can cause locking problems as it takes the backup.

One get get around this by adding –single-transaction to the command but if you have a replica of your databases available on another server you may choose to take backups of mysql databases from the slave to take some of the load away from the master server.

After all, in a mysql master slave setup, there could be multiple slaves but only one master. So backing up a slave, rather than the master makes logical sense.

Facilitating Reporting Using MySQL Replication

The business wants information from its data. The data is in the production databases. Running queries against databases for reporting can be heavy and affect performance to the main application due to the need to join many tables and aggregate lots of rows.

So you could schedule your reports to run off peak via some cron or windows task. But wait, the business wants these reports sooner, it doesn’t want to have to wait for this information for many hours. The data analysts need them today!

This is another great reason to setup MySQL replication, the replica can be used for heavy reporting tasks, whether these are reports within the application, data analysts or ETL jobs, the fact that the replica is not used by the production application means these tasks have no impact. Everyone is happy 🙂

Providing Failover/High Availability

It’s worth noting that MySQL replication is not a complete high availability solution due to the master/slave configuration. If the master goes down for example, there is a bit of work needed to promote an existing slave to become the master. For more information on a complete HA solution for MySQL, check out MySQL Cluster

In my mind, when I think high availability, database services are not interrupted at any level. Instead, we could say that MySQL replication provides increased availability or higher availability because of the replica’s that exist which can be used in the event of failover.

For businesses which can suffer some brief interruptions whilst the master is promoted,  MySQL replication increases database availability and fulfills a need. With more investment, availability increases. There is a link here which I am quoting from that summarizes this perfectly;

Each architecture used to achieve highly available database services is differentiated by the levels of uptime it offers. These architectures can be grouped into three main categories:

  • Data Replication.
  • Clustered & Virtualized Systems.
  • Shared-Nothing, Geographically-Replicated Clusters.

In my next post, I am going to look more closely at the replication event types.

Filed Under: All Articles, MySQL Administration Tagged With: backup and restore, mysql

What is MySQL Replication and How Does It Work?

Posted on January 31, 2016 Written by Andy Hayes 1 Comment

MySQL replication, a.k.a MySQL database replication provides the facility to make replicas of databases. The ability to make exact copies of databases and keep them in real-time sync as changes are made at the “master” provides a number of advantages.

In summary these are:

  • Scaling out a database application
  • Reducing database backup impact
  • Facilitate reporting without affecting production load
  • Failover/High Availability

I’m going to go into detail about these in my next post.

How does MySQL Replication Work?

Firstly we have to understand the architecture. At a minimum, in order to establish MySQL database replication, we have a master and a slave.

Typically, these would be two different servers.

The master is where all the changes happen. All database updates occur here, from adding, updating or deleting table records to creating functions, stored procedures or making table changes.

The slave server receives a copy of the changes applied at the master server. This all happens very quickly in order that the slave is always in sync with the master.

But how does this happen?

Replication events are written to a special log called the binary log on the master. The events are then read by the slave (a.k.a replica) and applied to the databases on slave server.

Replication Events

Two types:

Statement based – these are the write operations. They written to the binary log as SQL statements.

Row based – these are the row changes and they are written to the binary log as complete copies of the rows.

I will cover the differences of these and all of the advantages and disadvantages in a later post. When replication is set up, there is a choice between statement based, row based or mixed for the binlog-format variable to determine how the events are written to the binary log.

On the master

As updates occur, the events are written sequentially to the binary log for the slave to read later. Writes on the master are actually buffered to improve performance for when the slave is reading them as well as being written to the binary log.

The binlog dump thread – when the slave connects to the master, the master opens up a thread for the connection from the slave. This thread is just like any other thread however the difference will be that it will be typically opened using a dedicated user set up for the replication process and it just handles notifying the slave of changes and sending the binary log contents to it as it connects.

On the slave

When replication is started, two threads are created on the slave:

The IO thread – this thread connects to the master and reads the events as they come in. The replication events are then written to a local log file called the relay log which will be read by the second thread created on the slave.

The SQL thread – this thread reads the relay log (written by the IO thread) and applies the changes to the database as quickly as possible.

If you want to check the status of the IO thread, run “show slave status\G” on the slave:

  • Master_Log_File – the last binary log file copied from the master.
  • Read_Master_Log_Pos – the binary log from the master is copied over to the relay log on the slave up to the position listed.

If you want to see the status of the SQL thread on the slave, again using “show slave status\G” , check the following:

  • Relay_Master_Log_File – this is the relay log (binary log from the master) that SQL thread is currently processing.
  • Exec_Master_Log_Pos – the position in the binary log currently being executed.

I hope you found this post useful. Please share if you liked it. Thanks 🙂

Filed Under: All Articles, MySQL Administration Tagged With: mysql, replication

T-SQL – How to Select Top N Rows for Each Group Using ROW_NUMBER()

Posted on July 11, 2015 Written by Andy Hayes 19 Comments

I’ve recently been working on a data migration project and have found myself utilizing the ROW_NUMBER() function in SQL Server a lot. This function has been around since SQL Server 2005 and at its core, provides a way to provide sequential numbering for rows returned by a query.

One of the requirements for an export process was to return the most recent two customer orders for each customer account and outputting this to CSV file.

As well as sequential numbering, ROW_NUMBER() provides a way to partition the results returned so this suited me just fine 🙂

Let’s take a look at the problem in more detail and I’m going to use our good friend the AdventureWorks database to help demonstrate.

The first query returns the data by customer and with the most recent orders for each customer at the top.

SELECT SalesOrderID
      ,OrderDate
      ,SalesOrderNumber
      ,AccountNumber
      ,CustomerID
      ,SubTotal
      ,TaxAmt
      ,TotalDue
  FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] 
ORDER BY CustomerID, OrderDate DESC

Here are the results of this query sampled.

t-sql return top n rows per group sql server

How to return the top two rows for each group in the result set?

We will use ROW_NUMBER() to place a sequential id on the results but also partition the results so that each instance of the customerID value has its sequential id reset.

SELECT SalesOrderID
      ,OrderDate
      ,SalesOrderNumber
      ,AccountNumber
      ,CustomerID
      ,SubTotal
      ,TaxAmt
      ,TotalDue
    ,ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CustomerID, OrderDate DESC) AS RowNum
FROM [AdventureWorks2012].[Sales].[SalesOrderHeader]

This returns a new column called RowNum which is providing the sequential numbering for each instance of CustomerID based on the PARTITION BY and ORDER BY of the RowNum column.

t-sql return top n rows per group sql server with row_number

All that remains is to now select the first two rows for each CustomerID. So I am wrapping it up in a CTE and returning the first two rows using the WHERE clause.

WITH MyRowSet
AS
(
SELECT SalesOrderID
      ,OrderDate
      ,SalesOrderNumber
      ,AccountNumber
      ,CustomerID
      ,SubTotal
      ,TaxAmt
      ,TotalDue
    ,ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CustomerID, OrderDate DESC) AS RowNum
FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] 
)

SELECT * FROM MyRowSet WHERE RowNum <= 2

Which returns our desired result…

t-sql return top n rows per group sql server with row_number and cte

I hope you found this post useful. 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: sql server, t-sql

Data Management – Finding, Removing Duplicate Rows Using SQL and Some Prevention Tips

Posted on July 4, 2015 Written by Andy Hayes 2 Comments

find and remove duplicate dataDuplicate data is a common problem and in my experience comes in two main flavours; exact duplicate data and similar data. Exact duplicate data is when a data source contains the same values across all fields in the record. Similar data is when some of the fields match and to the naked eye can be classed as duplicate data but in fact it isn’t – it’s similar data and therefore it is unique.

In the case of similar data, an example might be where a file has been supplied which contains names and addresses. The file has been maintained in a spreadsheet and by more than one person. For whatever reason, the same name and address has been entered twice on two rows, except one editor did not include the postal code for the address. As a result, the rows are similar but not exact duplicates of each other. Apart from this, they are duplicate data.

Typically when loading data into a database table with unique constraints, de-duping the incoming data is a necessary step. For the purposes of this article, we will cover identifying and addressing exact duplicate data.

Lets imagine a scenario where you have been asked to de-duplicate a list of item codes and the data looks like this:

ItemCode
123456789
123456789
223344789
223344789
987554321
987554321

CREATE TABLE #ItemCodes
(
  ItemCode VARCHAR(50)
);

INSERT INTO #ItemCodes
VALUES
('123456789'),
('123456789'),
('223344789'),
('223344789'),
('987554321'),
('987554321');

Identifying Duplicate Rows Using SQL

The first step is to find the duplicate data. This can be easily achieved using a simple select query in combination GROUP BY and HAVING. If we loaded this data into a staging table with no constraints, we could proceed to query the data like this:

SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM #ItemCodes
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1

The resulting output reduces our 6 records to 3 because the the HAVING clause only returns the records that have more than one instance of ItemCode occurring.

How to De-duplicate Duplicate Data

Now that we have identified the duplicate data, we can do something about it. We also know how many duplicates we are dealing with. In the instance of a 6 row table, it could be easy just to manually remove the rows 🙂 However these tasks are typically more complicated involving many more rows than this.

The first step is to make the data unique and the easiest way to do this is add some unique identity column to the data. If the duplicate data is already in a table in production, adding a unique identifier is still the way to go but this might mean exporting the data first to a new table, deleting it from the source table and then re-importing after de-duplication has happened.

If the data is being loaded into a staging table then it is easy enough to add a column for the unique identity.

ALTER TABLE #ItemCodes 
ADD ID INT IDENTITY(1,1) NOT NULL;

Querying the data now, it is unique:

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
123456789                                          2
223344789                                          3
223344789                                          4
987554321                                          5
987554321                                          6

(6 row(s) affected)

Lets now delete that duplicate data!

We’re ready to remove the duplicate data from our table. This is done using a DELETE statement and a self join, like this:

DELETE t1
FROM #ItemCodes t1
JOIN #ItemCodes t2 
ON t1.ItemCode = t2.ItemCode 
AND t1.ID > t2.ID;

(3 row(s) affected)

SELECT * FROM #ItemCodes;

ItemCode                                           ID
-------------------------------------------------- -----------
123456789                                          1
223344789                                          3
987554321                                          5

(3 row(s) affected)

We’re now left with 3 rows, instead of 6. So our ItemCodes are now unique and can be loaded into our main database table. As you can see, ID’s 2, 4 and 6 were removed within the JOIN where t1.ID > t2.ID

Beware of lots of data and performance degeneration!

In an older post, I wrote about deleting rows from a table in a way which did not negatively effect performance too much. It’s worth a read as it illustrates what can happen if large transactions are issued in order to remove data. If you are dealing with large volumes of data, consider how to break the delete operation down into smaller parts to reduce the impact on your system if other users are querying the server.

Prevention is better than the cure!

Duplicate data can cost your business a lot of money in lost productivity time and incorrect results. Cleaning up the data is an expensive operation and can be ongoing until the source of the problem is addressed. Rather than continually repeating de-duplication tasks, communicate to the business the need to tackle the problem at the source.

As I referred to earlier, there are two forms of duplicate data, similar and exact. Preventing exact duplicate data means looking at your database and ensuring that the correct constraints are in place to prevent duplicate values. Such constraints being primary keys and unique column constraints.

When tackling similar duplicate data this can be a combination of ensuring that there are the appropriate constraints in place at both the database and application layers. For example if you have invested in a CRM system, whether written in-house of purchased off the shelf/cloud, make sure that it has sufficient duplicate detectors in place to at least warn the user that when entering a record, that there is a chance that the record is a duplicate.

If you have an application that has been written in-house for some function such as customer entry, ensure that there is address lookup functionality to help ensure that address information is entered in a clean way to help minimize the chances of mistyped address data, resulting in similar data.

Be proactive!

Similar duplicate data can creep into your system because it passes the application and database checks. I refer back to the example earlier in the post when a customer record existed twice, one with and one without a postal code. It’s important to realize early that a proactive approach to managing the duplicates can reap long term benefits, reducing overall cost to the business because the duplicates are addressed in a proactive way.

Some systems will provide reports which can help you identify duplicate data. Alternatively, you can run some SQL to identify duplicate rows in your tables and take the appropriate steps. Lets go back to our earlier example where we utilized the GROUP BY HAVING SQL but this time we will enhance it.

There is a table used in this query to contain items with descriptions, when they were added and by who. The SQL below shows the results of the GROUP BY and performs a JOIN back to the Items table to retrieve all fields for rows which have duplicate item codes. This helps provide a view of similar data with duplicate item codes.

CREATE TABLE Items
(
  ItemCode VARCHAR(50)
, DateAdded DATETIME
, Item VARCHAR(50)
, AddedByUser VARCHAR(10)
);

WITH Dupes
AS
(SELECT 
 ItemCode
 , COUNT(ItemCode) AS Duplicates
FROM Items
GROUP BY ItemCode
HAVING COUNT(ItemCode) > 1)

SELECT * 
FROM Items
JOIN Dupes 
ON Dupes.ItemCode = Items.ItemCode;

I hope you find this post useful. 🙂

Filed Under: All Articles, SQL Tips and Tricks Tagged With: data management, sql, sql server, t-sql

  • « Previous Page
  • 1
  • …
  • 4
  • 5
  • 6
  • 7
  • 8
  • …
  • 22
  • Next Page »

Categories

  • All Articles (82)
  • Career Development (8)
  • MySQL Administration (18)
  • MySQL Performance (2)
  • SQL Server Administration (24)
  • SQL Server News (3)
  • SQL Server Performance (14)
  • SQL Server Security (3)
  • SQL Tips and Tricks (19)

Top 10 Popular Posts

  • Using sp_change_users_login to fix SQL Server orphaned users
  • How to shrink tempdb
  • MySQL SHOW USERS? – How to List All MySQL Users and Privileges
  • How to Transfer Logins to Another SQL Server or Instance
  • How to Delete Millions of Rows using T-SQL with Reduced Impact
  • T-SQL – How to Select Top N Rows for Each Group Using ROW_NUMBER()
  • New T-SQL features in SQL Server 2012 – OFFSET and FETCH
  • How to Kill All MySQL Processes For a Specific User
  • Using exec sp_who2 to help with SQL Server troubleshooting
  • How to move tempdb

Recent Posts

  • How to Setup MySQL Master Master Replication
  • How To Use SQL to Convert a STRING to an INT
  • How to set up MySQL Replication Tutorial
  • How to Use SQL CASE for Conditional Logic in Your SQL Queries
  • Using ISNULL in SQL Server to Replace NULL Values

Search

Connect

  • Twitter
  • Facebook
  • Google+
  • RSS

About

  • Cookie Policy
  • Disclaimer
  • About
Copyright © ‘2021’ DBA Diaries built on the Genesis Framework

This site uses cookies. We assume you are happy with cookies but click the link if you are not. Close