Useful SQL
For finding Duplicates

Discussion:

While this can be slow to run as it usually has to read in the entire table, then sort it, then find duplicates, as an occasional run it is useful.

So please don't use it in a KPI! But do use it in a debugging, problem solving report.

Why is this hard for beginners?

If all you cared about was finding duplicates, you could enter

SELECT assetname
   FROM asset
   WHERE assetName like 'C%'
   GROUP BY assetname
   HAVING COUNT(*) > 1

This will return you a set showing the duplicates. But it will only show you the value of the duplicate column(s), not any other columns, and usually you need to know more about each column to do anything useful with the information of the duplicates. Usually, but not always. If just knowing which values are duplicates, then the above is perfect and you can stop at easy.

But usually you want some context, some of the other values. The 1st 'natural' thought most people have is to try this:

-- note: This code sample does NOT work, see explaination
SELECT assetname, assetID
    FROM asset
WHERE assetName like 'C%'
GROUP BY assetname
    HAVING COUNT(*) > 1

And SQL will return the error:

image.png

Which is pretty obvious what the problem is, but doesn't naturally lead to a solution.

Example of duplicates with context

Here's an example, looking for duplicate names in assets. It is perfectly acceptable to have duplicate names in assets, but sometimes it may cause problems.

After the example, there is a template that you can edit and use, or you can edit and use the example of course.

SELECT a.assetName, a.assetpk, a.assetid
FROM asset a
INNER JOIN (
    SELECT assetname
    FROM asset
   GROUP BY assetname
    HAVING COUNT(*) > 1
) d ON a.assetname = d.assetname
WHERE assetName like "CKX-%" -- so only show duplicates that start with CKX-
ORDER BY a.assetname;
 

Template

SELECT -- a.<column1>, a.<column2> etc..,
FROM <table> a
INNER JOIN (
    SELECT == <column(s) to look for dups>
    FROM  <table>
    GROUP BY -- <same column(s)
    HAVING COUNT(*) > 1
) d ON a.<columnFordup> = d.<columnFordup> -- both should be the same since they are the same table
-- WHERE <condition>
ORDER BY a.assetname;