Deleting Duplicate Records in FileMaker

Leave a comment

7-13-2007 by Jesse Antunes

Duplicate data always seems to creep it’s way into your system no matter how hard you try. Unfortunately, in FileMaker there is no easy way to deal with junk. Or is there?

This process depends on the fact that you have a unique id in your table. Ideally, this is an auto-enter serial number. If you have one of these then we are ready to go… if not, shame on you! You should pretty much always have a unique id of some type in your table and quite frankly you might as well make it a serial number. Go ahead and create one now… I’ll Wait.

Alright, at this point you are ready to go. Now you just need to decide what fields determine if the data is duplicate. If you want to weed out all products with the same sku or if you want to axe all customers with the same first name, last name and zip this solution is going to work for you.

Now that you have your fields ready, we are going to create a self-relationship. Create a new table occurrence of the table you want to remove duplicates from and connect it to the original one. Using my two examples from above, here is what your relationships should look like:

This relationship links a product record to any other duplicates.

or

This relationship links a customer record to any other duplicates.

This relationship says: “Find me all the records that don’t have the same unique id as this one but have matches in all the fields I want.” You can add as many predicates to this list as you want and it will still work.

Note: Predi-whats? Predicates. That’s the nerd word for the conditions you add in the Edit Relationship dialog box when you tell FileMaker how two tables relate.

Now that you have your relationship setup you have to make yourself a looping script. For this example, the layout containing all the records is called record_list and the relationship we are using to determine if there are duplicates is named duplicate_records:

 Go To Layout[ `record_list` ]
 Show All Records
 Go to Record/Request/Page [ First ]
 Loop
    If[ Not IsEmpty( duplicate_records::ID ) ]
       Delete Record/Request [ No Dialog ]
    Else
       Go to Record/Request/Page [ Next, Exit After Last]
    End If
 End Loop

So you are basically just looping through all the records and checking to see if there is a related duplicate…. If there is you don’t need it, so we just remove the record.

Note: This method of deleting records will remove the oldest record in the table in terms of creation order. So if you were to create a record today and one tomorrow, the record you created today would be deleted. If you would like to keep the oldest version of the duplicate record, just sort your ID field in descending order before executing the above loop

Instead of showing all the records, you first to narrow your found set by simply finding the records that have duplicates. To do this just execute a find on the id field of the duplicate_record table occurrence while you are on the record_list layout. Whatever you do… your table is now free of pesky, garbage data.

38 Comments

  1. Robert Jackson

    H there – I tried this and thought it was a pretty good idea. However I had a problem because the deleted duplicate was the original and this meant that any related tables using that id key would now be broken.
    e.g
    id: 10001, firstname: Robert, lastname:Jackson, employer: FileMaker
    id: 10002, firstname: Robert lastname:Jackson, employer: FileMaker
    The deletion of the duplicate would be id:10001. This key may be used in tables such as addresses or tasks or communications etc. Deleting id 10001 would orphan all records that had used that ID and wouldn’t pick up on the id now being 10002!
    Better to be able to delete the most recent record created by altering the script to go the last record and then loop to prev before deleting most recently created record/s.
    Whatdya think?

  2. Jesse Antunes

    @Robert – Thank you for this excellent comment. You could definitely achieve the desired results by going to the last record and looping backwards or by just doing a descending sort on the ID field first. The only reason why I hesitate to do the loop starting from the bottom is because I have never done a reverse loop before. Maybe it’s time to expand my horizons! I will edit the post so as to warn people about which record they are deleting

  3. Geoff Coffey

    @jesse, @robert: Also, it could conceivable be argued in some cases that you should keep the last record since it was the one most recently added. But different situations may call for different priorities. And Robert, you’re absolutely right to point out that (as is always the case when deleting) you have to be careful about related data.

    Jesse, it would be fun some day to do another version of this that keeps a list of IDs as it goes, and fixes related records once it gets to the final (keeper) record.

  4. David Head

    Just as a matter of interest, what is the purpose of the first script step:

    Go To Layout[ original ]

    Curious ;-)

  5. Jesse Antunes

    @David – it’s to align the stars appropriate so the nexus… of er… Saturn is peaked…. Um…. Yea it’s a little confusing. I actually named my layout original… as in it displayed original records. It’s definitely misleading so I changed it. Thanks David!

  6. Steve Wright

    I have been using this technique for quite some time, Im just planning to create a “syncronise feature” regarding which one to delete is a bit of a pain.

    I have a unique id throughout my tables, I also have a created and modified field along with a document id.

    In theory I should be able to check if doc id’s match, unique id’s dont and sort by the last modified date, however…

    On importing filemaker changes the modified date to the date of import and as perform auto enter options is required for creating a new unique id this creates a problem.

    Do you have any thoughts on this o a workaround to save my sore head?

  7. Bob Walker

    I work for an elementary school in Orlando. Is it possible to use the same scrip to evaluate duplicate addresses and then omit it? Is this even usable in FMP6? After I created the script and just substituted the Delete for Omit it omitted all records from the file.

    Thank you for your help.

  8. Geoff Coffey

    @bob: Unfortunately, you said “FileMaker 6″ which makes smoke start coming out of Jesse’s ears. He is, like, 12 or something. So I’ll answer this one:

    This script is made simple because the records are deleted as you go. This means that once you’ve removed all the duplicates but one, there will be no more duplicates, so the last won’t be deleted. If you Omit instead, it won’t work right (as you discovered). If you want to Omit you’ll have to get a little trickier. First, remove the ID ≠ ID predicate from your duplicates relationship. Then change this line:

     If[ Not IsEmpty( duplicate_records::ID ) ]

    to this instead:

     If[ Customers::ID <> duplicate_records::ID ]

    (Only, instead of Customers, you’ll use the main table occurrence for the table you’re removing duplicates from.) This code will cause the stuff in the If to run on all but the first of the duplicate records. Then you can switch the Delete step to a Duplicate and it should work.

    As for FileMaker 6, this technique won’t translate to 6 directly because 6 can’t do multi-predicate relationships. If you must do it, you can create a concatenated key — in other words, create a stored calculation field where you string together all the values you want to match on. Then create a new relationship matching this key to itself. Bear in mind, though, that this will only work with a relatively short set of data (approximately 80 characters, although FileMaker 6′s maximum index entry length rules were esoteric, so that’s not exactly right).

    Your best bet, of course, is to use 9 :)

  9. Douglas Alder

    One approach to preserving related records is to add some logic to the loop that checks to see if there are related records in various linked records and then use script variables to reassign these records.

    I did this for a solution where new records were being imported from a web form. I have a check to see if the ‘first name’ and ‘last name’ and ‘birthdate’ match. If so, then I offer to update the info.

    Still some records slip through and I have duplicates occasionally.

    So I made a very wide layout and put portals to important related records (invoices, payment info and correspondence, etc.)

    I made a script that searches for duplicates (I didn’t go the self-join portal route)

    Then I made two buttons, one that are labeled ‘Keep this Record, Delete Below’ and the other says ‘Keep this Record, Delete Above’ Depending on which button you push, the related records are assigned either to the one you are on, or the one below it and the record that has been stripped of it’s related records is deleted.

    There will likely be some fields contained on the record to be deleted that I want to keep the values, so I set the values of those fields to a batch of script variables, then switch records and set those fields to the newly created script variables.

    Then I run through the portals and do something similar.

    Here is an example of a loop I use:

    # Check to see if there are any related Payments
    If[Count(People_PAYMENTS~ID_People)>0]
       # If there are, then set a variable with the number of related Payments.
       Set Variable ($PaymentsPortalCount; Value:Count(People_PAYMENTS_ID_People:ID People)
       # Run a loop to reset these to the new ID People
       Loop
          # Exit the loop if the current Portal Row Number exceeds the variable set in the previous step.
          Exit Loop If [get(PortalRowNumber)>$PaymentsPortalCount]
          # Set the field to the ID People from the previous record (we set this variable before when on the record that we want to keep).
          Set field [People_PAYMENTS~ID_People; ID People: $IDPeople]
          Go to Portal Row [Select; Next]
       End Loop
    End If
    

    It is a little complicated, but basically, I am selecting the record I want to keep, storing it’s ID in a variable, then going to the other record and reassigning all the ‘soon-to-be-deleted’ record’s related records to the script variable’s ID.

    Clear as mud?

  10. Geoff Coffey

    @douglas: Great comment. Jesse and I talked about the related records question and figured we would use a similar approach. The only thing I would add is that I tend to avoid scripting portals. You may have to add an extra layout or two, and you need a few extra script steps, but in the end, I think it is safer to Go to Related Records and loop through the records instead.

    The trouble with portal looping is twofold: First, the portal has to remain on the layout. I try hard to avoid scripts that will break if someone “cleans up” a layout. Second, it is too easy to lose your place on a portal. Many things can cause FileMaker to switch back to the first row (like committing the record, switching modes, switching layouts, etc…). Sometimes a script grows beyond our original expectations, and it becomes very hard to remember that you need to be careful about portal context. (For example, imagine you later call a subscript from inside your loop to do something shared by other scripts. Now you just have to know not to commit in that new subscript).

    So if it were me, I’d revise your script to switch layouts for each relationship. But otherwise, I love it :)

  11. Keriann K

    Thanks so much for this. It has been a great solution for me, even as someone who is new to using scripts.

    One question- is there anyone to bypass that nagging dialog box that asks if I’m sure I want to permanently delete each record? I’m not looking forward to spending this afternoon clicking delete several thousand times…

    Thanks again!

  12. Geoff Coffey

    @keriann: If you look closely, you’ll see the No Dialog option on the Delete Record script step. That means that in FileMaker, when you select that script step, you should turn on its “Perform without dialog” option. This tells FileMaker not to show a dialog box before deleting the record. Hope this helps!

  13. Erik Cayré

    I just thought of a way to solve the orphaned related records problem:
    Add a text field to the table where the dupes will be deleted, and populate it with the unique ids already present in the table’s primary key field.
    Then change all relationships from using the unique key field to using the new field. Everything will still be related exactly as before.

    Now, in the find/delete dupes loop, just before deleting: add the key of the record being deleted to the new key field of all self related records, separated by a newline.

    The end result will be that after all dupes are gone, all records formerly related to any now deleted record, will now be related to the ‘surviving’ record effectively consolidating all related redords to the kept records…

    Remember to add an auto-enter calc to the new key field which adds the real unique id, in case it is missing, so that the mechanism stays functional).

    Hope this helps!

  14. Paul

    A duplicate record in my book is a record where ALL fields are identical to the original record (with the exception of the record number or record ID). If you want to remove duplicate records ALL fields have to be checked – save the ID field. You can do that by integrating the values of all fields (except record ID) of one record into one variable x – like a checksum over the record would do, and then insert that value into a new “checksum” field created for the purpose:

    x = field_1 & field_2 …field_n

    Then sort all records based on the “checksum” field, so that identical records are next to each other.

    Then simply compare the value of x of record 1 with the value of
    x of record 2 – and if they are identical you can delete one of the two records.

    If you have more than two duplicate records you simply run the script again – until the number of records stays constant.

    As my Filemaker sadly does not allow me to export scripts I cannot show you the actual text of the script.

  15. Geoff Coffey

    @Paul: That will work in many cases, but it has a pretty significant drawback. If you make your calculation field stored, you are effectively doubling the amount of data FileMaker has to store and manage. That can have a significant impact on backup systems, increase the cost of disk systems, and slow things down if you’re dealing with a large database.

    If you make the calculation unstored, then the search performance will be very poor in a medium-to-large table.

    The method in this article will perform well on large tables, and although it might introduce some indexing overhead, the impact will be much less than a full duplication of all data.

  16. Dave Bertram

    I just want to compare last names and selectively delete the ones I don’t want. Do I need a script or is there a feature, like in itunes that shows duplicate records ?

  17. Geoff Coffey

    @dave: Switch to Find mode, put ! in the Last Name field, then sort by Last Name. The ! find operator finds all records with duplicates in that field.

  18. The solution seems elegant since it doesn´t require the definition of new fields (except for the ID that almost every DB has already) but…. I don´t get one thing:

    If[ Not IsEmpty( duplicate_records::ID ) ]

    What is duplicate_records? a table? relations can have a name?
    The tables in your example are named products and Products_dup, so the notation duplicate_records::ID don´t make sense to me).

  19. Jesse Antunes

    @Javier: I can see how it’s a little bit confusing. Just before that code block I switched my example to a more generic look. So `duplicate_records` is the table occurrence that is used to determine if there is a duplicate record.

    So if you were looking at the first two examples, the fields you would check for are Products_dup::ID or Customers_dup::ID. If you find a value in those fields, then you know you have a duplicate record and you can just delete it.

  20. Søren Dyhr

    I’ve just learned that the approach showed here, not is most potent algoritm one could come up with, using the scriptmaker steps availiable.

    Following script using a tecnique introdced to me by Agnès Barouh of Tic Tac:

    [code]Sort Records [ Speci?ed Sort Order: Contacts::FistName; ascending
    Contacts::LastName; ascending
    Contacts::ZipCode; ascending ]
    [ Restore; No dialog ]
    Replace Field Contents [ Contacts::gStrainedList; Replace with calculation: Case(
    Get( RecordNumber ) ? 1 and Contacts::FistName = $t and
    Contacts::LastName = $u and Contacts::ZipCode = $w;
    Let( $theIDs = $theIDs & ¶ & Contacts::RecordID; "" );
    Let(
    [
    $t = Contacts::FistName;
    $u = Contacts::LastName;
    $w = Contacts::ZipCode
    ];
    ""
    )
    ) ]
    [ No dialog ]
    Set Field [ Contacts::gStrainedList; $theIDs ]
    Go to Related Record [ From table: “Contacts 2”; Using layout: “Contacts” (Contacts) ] [ Show only related records ]
    Delete All Records [ No dialog ]
    [/code]

    –sd

  21. DaveMill

    Hi, all,

    I’ve been playing around with the built-in ! operator in find mode, which finds all records with non-unique values in a field.

    If I create a calculation field to concatenate multiple fields, I can use the ! operator to find non-unique values in multiple fields.

    And since my databases (and I suspect those of many of you) are so complex, I don’t want to let a script automatically delete first, last or any other duplicate value. I need to make that decision myself.

    So my deduplicate solution is to use the ! operator, sort the results, then click a check box field to identify records I want to delete. This works great for me.

    Later!

  22. Craig

    If you are unsure about running a loop on a large DB you could also export all data into Excel and use remove duplicate filter.

    Google: excel remove duplicates

  23. Kevin Lincicome

    In my DB I have a list of names/ID’s in one table, let’s call this “Old Customers”. I then have a different table with another set of names/ID’s, I call this “Current Customers”. I want to email an offer to all “Old Customers” as long as they are not Current Customers. How can I delete from “Old Customers” anyone that is also a “Current Customer”?
    This script looks like it should work, but seems to delete the entire list of “Old Customers”. Should there be a 3rd table “Duplicate Customers” or is the relationship Old Customer::ID = Current Customer::ID correct?
    Sorry… I have been trying to wrap by brain around this and may be confused on something quite simple!
    Thanks for any help.

  24. Kevin Lincicome

    Relationship does not appear above, it should read;
    “or is the relationship Old Customer::ID = Current Customer::ID correct?”

  25. Awesome! many would say the holy Grail of filemaker to be able to remove duplicate records! Worked like a charm.

  26. Dave

    If two field records are blank, the comparison does not work and the records are not deleted. What should I do?

  27. Wayne

    I’m self taught using FM9 and am trying to remove dupes as indicated. Am I right in assuming that the duplicate table populates automatically from the record list once the relationships have been defined, or do I have to do something else?

    Thank you.

  28. Atif Khan

    I created this script and it worked fine. $ID will copy the value for which I am looking a duplicate record. Sort the records first and run this script.

    I don’t know how to copy the script form FileMaker to here so this is just my typing here.

    Show all records
    Sort Records
    Go to First Record
    Set Variable $Name=cFirstNameLastName (assuming I am checking on this field)
    Go to Next Record– to match the $name
    Loop
    IF( cFirstNameLastName=$NAME
    DELETE RECORD — it is same as previous record delete it
    ELSE
    Set Variable $Name=$NAME –if not reset the variable to current records value
    GO to Next Recrd (next, exit after last)
    End if

    END LOOP

  29. Atif Khan

    I created this script and it worked fine. $NAME will copy the value for which I am looking a duplicate record. Sort the records first and run this script.

    I don’t know how to copy the script form FileMaker to here so this is just my typing here.

    Show all records
    Sort Records
    Go to First Record
    Set Variable $Name=cFirstNameLastName (assuming I am checking on this field)
    Go to Next Record– to match the $name
    Loop
    IF( cFirstNameLastName=$NAME
    DELETE RECORD — If it is same as previous record delete it
    ELSE
    Set Variable $Name=cFirstNameLastName –if not reset the variable to current records value
    GO to Next Recrd (next, exit after last)
    End if

    END LOOP

  30. Don Walker

    I’m trying to work a variation on this theme. I’m working with a sports league which requires each player to be individually registered. I can’t delete the “duplicate” records but want to create a mailing list that has each address once with the first names concatenated into a “names” field to go along with the last name and address fields.
    thanks,
    Don

  31. Nunz

    thnx Jesse and Geoff

    As always, invaluable tips and tricks. FYI, i am using this technique in place of the ‘validation’ rules in the field definitions. So I am allowing a user to create a record and enter all the details (in the case a ‘company record’). Then, before the record becomes ‘active’, i use this tip as part of the ‘error check routine’ to ensure they haven’t created a duplicate name or company code.

    The reason i’m doing this is so that they can fill in a record (and leave it in complete if they need to…while they hunt down phone numbers and stuff we REQUIRE to setup an account). Once they have all the info in it, they ‘activate’ the record (meaning work can be done with that record) and then i use this as part of the error checking step…to ensure we don’t GET duplicates!

    So with a bit of luck, using this tip HERE, will stop the duplicates occurring…which was the reason for the article in the first place!

    Once again. Thnx

  32. Emma

    Many thanks indeed, that saved a good few hours for me!

  33. Gerardo Verrone

    Hi Atif Khan,
    Yesterday and without looking here I’ve performed the same script as you. The reasson why I’ve started to research for other way was because if we do it like you mentioned in a big table it takes so long.
    For example I want to filter duplicated records in a table with more than 16.000 records and the result was 27 minutes to clean it in a Dual Xeon brand new computer.
    I imagine two possible ways to sort this but I did not tried yet any of this scenarios.
    First Case
    1) Sort the table by the desired field I want to filter duplicates.
    2) Introduce later a Field called something like Id_Number as serial non repetitive value so each record has a different number.
    3) Inside a Loop structure grab the first record and store his Id_Number in one variable and the field you want to clean in other variable.
    4) Perform a find looking for records equal to the one you want to delete and loop while you delete but bypassing the first one. This can be achieved by ommiting the first from the found set or simple jumping two records in this internal loop only the first time.
    5) Delete all the found set.
    6) Show all records
    6) Perform a search to return to your record_id stored at 3, so the pointer will jump to that record
    7) No you only need to go to next record (until last) and repeat alll the steps from 3 in advance.

    Basically this structure ios the same as yours but istead of looking with each record for a repetition along all your table in a looping way you are grabing apart all the repetitive stuffs to delete it in just one step. I think this will be quicker.

    The second method I have found is the one ilustrated in the help of filemaker using twice the same table in a relationship. Look in the help of FM11 for something like “duplicate” or look at here http://forums.filemaker.com/posts/690cd1e356

    Good Luck
    G.

  34. Gerardo Verrone

    It’s Me again and with good news !!!
    Mi first idea works pretty quick with a big table.

    I will explain my fields and type including the full script as with this I can handle a clean process in just .17 seconds compared to 27 minutes using the record by record way.

    Table name: People

    Record_ID (Serial)
    Full Name (Text)
    Soc_Security_Number (Number)
    FCount (calculation with the formula Get ( FoundCount ) and the tick Do not evaluate if all… at the bottom uncheked)

    Start_Time (Time) Global
    End_Time (Time) Global
    Total_Time (Time) Global

    The table has 16.000 records and most of them have the social number and the Full name repeated several times but.

    Inside the table is arranged with all the repetitive Full Names twogether one after the other like This:

    Id F. Name Social Number

    1 Jhon Smith 232323
    2 Jhon Smith 232323
    3 Maxwell Hunt 122343
    4 Maxwell Hunt 122343
    5 Maxwell Hunt 122343
    6 Maxwell Hunt 122343
    7 Karen Wright 698956
    8 Karen Wright 698956

    and so on until 16.000

    The Script

    Set Error Capture [on]

    #Set the Time when the script start in a global (Time type) field. Use time not time stamp as with time stamp is not possible to substract time values.

    Set Field [Start_Time; Get(CurrentHostTimeStamp)]
    Show all records
    Sort Records [by Record_ID] Just in case when you come to this layout or table it is unsorted.
    Go to record/Request/Page [First]

    Loop
    Set Variable [$Social; Value:Soc_Security_Number]
    Set Variable [$Record_Num; Value:Record_ID]
    Perform Find [looking for the equivalence Table:Soc_Security_Number = $Social. With this all the other records are omited]
    #If More than one record found, delete but omiting the first one.
    If [People:FCount > 1] #Meaning you have more than one record in the last found set
    Go to Record/Request/Page[First]
    Omit Record #This will keep the first one
    Delete all Records [No Dialog] # This delete all duplicated
    End if

    # Return to the position of the Record omited to jump to the next one and continue cleaning.

    Perform Find[$Record_Num]
    Show All Records
    Go to Record/Request/Page [Next; Exit after last]
    End Loop

    Set Field [End_Time; Get(CurrentHostTimeStamp)]
    Set Field [Total_Time ; Start_Time - End_Time]

    # Show a Message stating when the scipt has finished comparing actual time and start time from a global fields.

    Show Custom Dialog [“Process Finished” & “Start Time: ” & Start_Time & “End Time: ” & End_Time & “Total Time: ” & Total_Time

  35. Gerardo Verrone

    Finally Guys I’ve tested the Filemaker Suggested Method of two related tables, counter, calculation fields, etc and the result shoked me.

    Import Time From Excel 18 Sec.
    Amount of Start Records to Clean: 16592
    Cleaning of Duplicated Records 6 Sec.
    Amount of records Unique: 2387

    With a couple of test, imagination and the amazing suggested FM instructions I have decreased a 27 min duplicated search to just 6 sec.

    All the instructions are posted here: by the user: “bodhibrian”

    http://forums.filemaker.com/posts/690cd1e356

    Cheers bodhibrian to give all of us a great and fresh idea to speed up the process.

    Gerardo Verrone.
    animatorstudio@hotmail.com

  36. Fay

    worked a CHARM, thanks!

  37. Goetch

    One tip for large databases is to insert a freeze window before you loop. This drastically reduced the time it took for this simple solution to work. My database of 25000 records originally took about 20 minutes to run, after the freeze window it only took 1 to 2 minutes to run.

  38. Erica

    In order to decrease duplicates, is there a method for altering the Command-D shortcut to insert a message – “Do you really want to duplicate this record?” and a cancel or go forth?

Tell Us What You Think

*
* (will not be published)