+ Reply to Thread
Results 1 to 20 of 20

Indentifying potential duplicates

  1. #1
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Lightbulb Indentifying potential duplicates

    Hi everyone,

    Something that has been on my mind for a while now but unsure how to approach it. I data in a column (words, numbers, punctuation) that I have to remove potential duplicates from. If the data is exactly the same, Excel 2007 can easily identify these using conditional formatting but when the data is slightly different I don't know of any way to identify them. SImple example would be the use of American English vs. British English or the presence of an extra space between two words, etc. Does anyone know of a way to sort data according to similarity of the contents of the cells in a column?

    Thanks.

    abousetta
    Last edited by abousetta; 08-03-2010 at 01:31 AM.

  2. #2
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Further random thoughts...

    Word-level:
    Could we compile an array of all the words present in a cell and the number of times each appears and compare the arrays from each several cells in the same column [I have absolutely no idea if this is feasible but sounds like it could be possible]. I know shg and others have produced macros to allow counting of words from text. Could this be extended to allow the comparison of two strings of text words?

    Character-level:
    Could we compile (dare I say array again) a list of all the characters used in a cell and the number of times each character is used and compare this list with other cells?

    I have found code written by Excel MVP Rick Rothstein and presented on John Walkenbach's site that determines if a particular word is contained in a text string. Could this modified to bring the Word-level idea to light?

    Please Login or Register  to view this content.
    Any thoughts or ideas are welcome.

    abousetta

  3. #3
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2003, 2010
    Posts
    40,678

    Re: Indentifying potential duplicates

    Using Trim on the list would eliminate the problem of leading / trailing / multiple interior spaces.

    How long is the list?

    Can you post a large example?
    Entia non sunt multiplicanda sine necessitate

  4. #4
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Hi shg,

    The list can be pretty long (from a couple of rows of data to several thousands of rows). I have attached an example of data that I got for one journal from PubMed. I would only be comparing the data in each column but not across columns.

    Thanks again.

    abousetta
    Attached Files Attached Files

  5. #5
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2003, 2010
    Posts
    40,678

    Re: Indentifying potential duplicates

    So give an example of a near-duplicate. In order to be considered, would columns B, C, and D ALLL have to be 'close' to one another?

  6. #6
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Hi shg,

    I would run the duplicate search on A, B or C seperately. So I would run the search for A and check all duplicates, then for column B and so on.

    I have attached a mock sample workbook that shows small differences. I have three sets of examples and have highlighted the differences between the two rows in each set.

    Thanks again.

    abousetta
    Attached Files Attached Files

  7. #7
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2003, 2010
    Posts
    40,678

    Re: Indentifying potential duplicates

    OK, I'll look at this in the next few days.

  8. #8
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    As always a million thanks shg.

    abousetta

  9. #9
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Hi shg,

    Any thoughts on how to proceed? If my thinking is illogical or less intuiative than anothe approach, I am happy to re-evaluate the proposed ideas.

    Thanks again.

    abousetta

  10. #10
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2003, 2010
    Posts
    40,678

    Re: Indentifying potential duplicates

    Run sub x. It takes a while once it gets to the long entries in col D.
    Attached Files Attached Files

  11. #11
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Thanks shg... I will get it a test run and let you know how it works.

    Have a great weekend!!!

    abousetta

  12. #12
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Hi shg,

    Thanks for working on this. I noticed one problem with the macro that makes it skip comparisons of cells from the same column. It tested it with a new dummy set of letters, numbers and punctuation. I noticed that punctuation (e.g. commas, periods, etc.) completely thowns it off and I don't know how to work around this.

    Example below works perfectly:

    Number Col_1
    1 a
    2 ab
    3 abc
    4 abcd
    5 abcde

    Then I started to add punctuation, spaces, etc. and things then fall apart. The macro seems to compare the cell with the next cell only and ignores all other cells in the same column.

    I have an idea to bounce off you that might be a workaround. Please let me know what you think...

    We could create a copy of the worksheet to a temporary sheet, remove all spaces and punctuation so that the string in each cell would only be a single continuous string of letters and numbers. Then have the macro compare the cells in the same column and present the results in a new worksheet (which it does now). Then delete the temporary worksheet. What do you think? Or is there a way to leave everything as is but to direct the macro to ignore spaces and punctuation when comparing the strings?

    Thanks again shg!!!

    abousetta

    CORRECTION: There is no error. Check my later post to see what I was doing wrong.
    Attached Files Attached Files
    Last edited by abousetta; 08-02-2010 at 02:41 AM. Reason: There was in fact no bug in the macro.

  13. #13
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2003, 2010
    Posts
    40,678

    Re: Indentifying potential duplicates

    I don't know why you say it's wrong. It reports the edit distances accurately.

    The macro seems to compare the cell with the next cell only and ignores all other cells in the same column.
    Change the constant MAXDIST to something larger than it's current value, which is 4.
    Last edited by shg; 08-01-2010 at 02:23 PM.

  14. #14
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Hi shg,

    I apologize for wrongly identifying an error when in fact there was none. You are correct. The problem was not with the punctuation as I thought was the case, but in fact it was just the MAXDIST value as you correctly pointed out. When I increased this and re-ran the macro it worked correctly. You were thinking one step ahead of me when you wrote the macro and correctly considered that with more than four iterations, that the probability of the two strings to be similar was minute and therefore there was no reason for the macro to keep comparing the two strings if they were obviously not the same.

    Once again a thousand apologies and a million thanks for a job well done.

    abousetta

  15. #15
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Hi shg,

    Just a question regarding efficiency. Would there be a way to make the Function EditDiststop if the MAXDIST has been reached? I tested the macro with large strings and it takes relatively the same amount of time to compare two strings with one deletion or 100 deletions. In the former case, the macro prints the result but in the latter case it just discards them and moves to the next compariosn. Is there a way to stop the compariosn from running once it reaches teh MAXDIST to speed up the process?

    Thanks for your insight.

    abousetta

  16. #16
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2003, 2010
    Posts
    40,678

    Re: Indentifying potential duplicates

    Replace the code with this:
    Please Login or Register  to view this content.

  17. #17
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Hi shg,

    Thanks for all your help. I have run and compared the results but must be doing something wrong. The modified macro works just like the original if the MAXDIST is a large number (e.g. 10000000) in order to capture all changes. When I drop this number down to a realistic number, it does not return anything.

    Example... In the attachement I have two identical strings to compare with only one difference (the first letter of the second string is deleted). If I use the large MAXDIST then it returns the value of 1 which is correct. When I change the MAXDIST to for example 6 it returns a blank; which from my understanding means that the number of changes is above 6 but isn't the actual case.

    Could you please help me understand what I am doing wrong?

    Thanks again.

    abousetta
    Attached Files Attached Files

  18. #18
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2003, 2010
    Posts
    40,678

    Re: Indentifying potential duplicates

    I think it is the case that I don't understand the EditDist algorithm well enough to short-circuit it as I tried to do. I don't have another suggestion ATM, abousetta, sorry.

  19. #19
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2003, 2010
    Posts
    40,678

    Re: Indentifying potential duplicates

    Actually, try this:
    Please Login or Register  to view this content.

  20. #20
    Forum Guru
    Join Date
    03-12-2010
    Location
    Canada
    MS-Off Ver
    2010 and 2013
    Posts
    4,418

    Re: Indentifying potential duplicates

    Thanks shg!!!! Excellent... It seems to work now. I will keep testing different scenarios but am confident that everything works as planned.

    Thanks again.

    abousetta

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1