MySQL: Find Duplicate Entries in a Table

Here is a little background for this issue. The database used to run on ancient copy of MySQL until very recently we upgraded it to 5.0. Not without some headaches of migrating the database, but it worked. Common problem we had with this database was duplicate entries in printed reports. Why did these happen? Mostly because of user error combined with a lack of foreign key constraints.

The two MyIsam tables that were causing the issue were: report and proforma. Both were used for tracking documents through the review process. Report was the actual field audit report that would be submitted, reviewed and sent to the clients, while proforma was the documentation of expenses that our employees were supposed to submit for each assignment. Both would store dates, comments, notes and other info regarding said documents. The proforma had a 1-1 relationship with report (each proforma was tied to a single report entry). But since the old MySQL version did not support Foreign Key constraints this was not enforced (or rather only enforced by the PHP front end).

However users would continuously find ways to submit multiple proformas for the same report – for example by clicking the “submit” button 7 times. Other fun trick for re-submitting proforma was to send it in as TBA (to be annouced). Upon seeing the TBA the clerical staff in the office would manually associate it with appropriate report, usually without checking if another copy is already in the database. And when we locked that down, they would just continuously send bug reports about it forcing us to sort this out at the DB level.

What happened if the non-existent FK constraint was violated? One of the main reporting sections of the site used a complex join across these two tables. If there were two or more proformas per report, that report would show up on the list multiple times. What do people do when they see duplicates on the list? They start deleting them. Problem is – these were not real duplicates, but the same entry repeated several times. Deleting one copy would hose all of them, causing yet more support request for restoring the entry from the nightly DB dump.

Fun times. So after we switched to a DB engine that was actually developed in this century I immediately switched the tables to InnoDB and put a freaking foreign key constraint on that relationship. This way if someone finds a new loophole in the PHP code they will end up seeing a nice MySQL error instead of creating new duplicates.

But guess what – I just got more bug reports about duplicate entries. Apparently the FK constraint is not retroactive, and MySQL will gleefully allow duplicate keys to exist in your FK column when you apply the constraint. I figured that it would start trashing about and keep giving me errors if that was the case, but no – all it cared about was an index on the key column. So I was left with a task to track down all the duplicate entries in the foreign key column in my proforma table. How do I do this? This problem actually got me scratching my head for a bit, until I had a sudden epiphany:

SELECT 
	id, report_id, COUNT(*)
FROM
	proforma
HAVING
	COUNT(*)>1

I found around 17 separate instances of the foreign key (here report_id) being repeated anywhere from 2 to 8 times. All of these were for older entries that no longer show up on the first page of results, so naturally no one noticed. Still, it shows that the problem was more prevalent than we initially suspected.

What I’m really trying to say here is this: use database constraints to enforce table relationships. This is the only proper and effective way to do it. Trying to enforce constraints in software is just asking for trouble as you create multiple points of failure. Different parts of the code will update any given table at different times, and all of them must check this. Failing to implement proper checks in every piece of code that touches that table may lead to duplicate entries in FK columns and similar undesired side effects. A simple constraint on a column will stop this from happening much more effectively, with much less effort.

[tags]mysql, sql, database, database design, foreign key, key, constraint, php[/tags]

This entry was posted in programming and tagged , . Bookmark the permalink.



4 Responses to MySQL: Find Duplicate Entries in a Table

  1. Ricardo INDIA Mozilla Firefox Windows says:

    Isn’t it also the case of having multiple of the same proforma with different ids?

    In that case, you would need to identify columns that would have the value and compare them, like this:

    SELECT
    id, report_id, title, date
    FROM
    proforma p1, proforma p2
    where
    p1.title = p2.title
    and p1.date = p2.date
    and p1.report_id = p2.report_id
    and p1.id p2.id

    Since you said this is a 1-1 relationship, there shouldn’t be no results from this query.

    Reply  |  Quote
  2. Luke Maciak UNITED STATES Mozilla Firefox Windows says:

    Actually, this is a whole separate problem since all proformas have unique filenames associated with them. So checking for duplicate entries submitted under wrong report ID is always a wild card search.

    And this is actually one of the problems that is a 100% end user problem, and can be easily fixed by the office staff without involving the IT.

    Reply  |  Quote
  3. Nadab INDIA Mozilla Firefox Windows says:

    hi all, Few days back i was looking for mysql administrative tool. i came across sqlyog which has super cool features like data sync, migration from various like access, sql server with much more powerful tools. I was very impressed with the GUI and also their multiple databases connections. Just have a look into it !!!

    Reply  |  Quote
  4. Derek Mozilla Firefox Windows says:

    I think the code should be:

    SELECT 
    	id, report_id, count(*)
    FROM
    	proforma
    GROUP BY
    	id, report_id
    HAVING
    	count(*)>1
    Reply  |  Quote

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>