Monday, July 24, 2017

SQL - Delete Duplicate Rows


This blog will demonstrates how to delete duplicate rows by using SQL Row_Number() function in sql server.

Here is a Student table which have few duplicate records (in yellow color)

ID
Code
Address
Created Date
1
CN01
Memphis, TN
7/1/2017
2
CN02
Nashville, TN
7/9/2017
3
CN03
Cordova, TN
7/10/2017
4
CN01
Memphis, TN
7/11/2017


First we need to decide, which row we are going to keep and which one we are going to delete.

Here is an example to delete all duplicate records except the oldest record.

Step 1: Get duplicate student records – here is a sub query which group by code and select only those code which are having more than 1 records (duplicate records)

Select Code, count(Code) [Ct] from vr.Student group by Code
having count(Code) > 1

Step 2: Generate Row Number by using SQL ROW_NUMBER() function for each duplicate records

With CTE AS
(
Select y.* , ROW_NUMBER() OVER(PARTITION BY y.Code ORDER BY y.Code, y.ID) AS RowRank from 
(
Select Code, count(Code) [Ct] from dbo.Student group by Code
having count(Code) > 1
) y
)
Select * from CTE

Output: 

ID
Code
Address
Created Date
RowRank 
1
CN01
Memphis, TN
7/1/2017
1
4
CN01
Memphis, TN
7/11/2017
2

In above output, the oldest duplicate row has low row rank and now we can delete all duplicate records except the oldest record (row rank = 1)

Delete from dbo.Student where ID IN (Select ID from CTE where CTE.RowRank <> 1)

Here is a complete SQL Query


With CTE AS
(
  Select y.* , ROW_NUMBER() OVER(PARTITION BY y.Code ORDER BY y.Code, y.ID) AS RowRank  from 
(
Select Code, count(Code) [Ct] from dbo.Student group by Code
having count(Code) > 1

)

Delete from dbo.Student where ID IN (Select ID from CTE where CTE.RowRank <> 1)

No comments:

SQL Server - Identify unused indexes

 In this blog, we learn about the index usage information (SYS.DM_DB_INDEX_USAGE_STATS) and analyze the index usage data (USER_SEEKS, USER_S...