Indentifying potential duplicates

**abousetta** · 07-21-2010, 05:16 AM

Hi everyone,

Something that has been on my mind for a while now but unsure how to approach it. I data in a column (words, numbers, punctuation) that I have to remove potential duplicates from. If the data is exactly the same, Excel 2007 can easily identify these using conditional formatting but when the data is slightly different I don't know of any way to identify them. SImple example would be the use of American English vs. British English or the presence of an extra space between two words, etc. Does anyone know of a way to sort data according to similarity of the contents of the cells in a column?

Thanks.

abousetta

**abousetta** · 07-21-2010, 10:20 AM

Further random thoughts...

Word-level:
Could we compile an array of all the words present in a cell and the number of times each appears and compare the arrays from each several cells in the same column [I have absolutely no idea if this is feasible but sounds like it could be possible]. I know shg and others have produced macros to allow counting of words from text. Could this be extended to allow the comparison of two strings of text words?

Character-level:
Could we compile (dare I say array again) a list of all the characters used in a cell and the number of times each character is used and compare this list with other cells?

I have found code written by Excel MVP Rick Rothstein and presented on John Walkenbach's site that determines if a particular word is contained in a text string. Could this modified to bring the Word-level idea to light?

Please Login or Register  to view this content.

Any thoughts or ideas are welcome.

abousetta

**shg** · 07-21-2010, 10:36 AM

Using Trim on the list would eliminate the problem of leading / trailing / multiple interior spaces.

How long is the list?

Can you post a large example?

**abousetta** · 07-21-2010, 11:10 AM

Hi shg,

The list can be pretty long (from a couple of rows of data to several thousands of rows). I have attached an example of data that I got for one journal from PubMed. I would only be comparing the data in each column but not across columns.

Thanks again.

abousetta

**shg** · 07-21-2010, 11:15 AM

So give an example of a near-duplicate. In order to be considered, would columns B, C, and D ALLL have to be 'close' to one another?

**abousetta** · 07-21-2010, 11:42 AM

Hi shg,

I would run the duplicate search on A, B or C seperately. So I would run the search for A and check all duplicates, then for column B and so on.

I have attached a mock sample workbook that shows small differences. I have three sets of examples and have highlighted the differences between the two rows in each set.

Thanks again.

abousetta

**shg** · 07-21-2010, 11:54 AM

OK, I'll look at this in the next few days.

**abousetta** · 07-21-2010, 01:05 PM

As always a million thanks shg.

abousetta

**abousetta** · 07-31-2010, 02:21 AM

Hi shg,

Any thoughts on how to proceed? If my thinking is illogical or less intuiative than anothe approach, I am happy to re-evaluate the proposed ideas.

Thanks again.

abousetta

**shg** · 07-31-2010, 11:28 AM

Run sub x. It takes a while once it gets to the long entries in col D.

**abousetta** · 07-31-2010, 07:29 PM

Thanks shg... I will get it a test run and let you know how it works.

Have a great weekend!!!

abousetta

**abousetta** · 07-31-2010, 11:28 PM

Hi shg,

Thanks for working on this. I noticed one problem with the macro that makes it skip comparisons of cells from the same column. It tested it with a new dummy set of letters, numbers and punctuation. I noticed that punctuation (e.g. commas, periods, etc.) completely thowns it off and I don't know how to work around this.

Example below works perfectly:

Number Col_1
1 a
2 ab
3 abc
4 abcd
5 abcde

Then I started to add punctuation, spaces, etc. and things then fall apart. The macro seems to compare the cell with the next cell only and ignores all other cells in the same column.

I have an idea to bounce off you that might be a workaround. Please let me know what you think...

We could create a copy of the worksheet to a temporary sheet, remove all spaces and punctuation so that the string in each cell would only be a single continuous string of letters and numbers. Then have the macro compare the cells in the same column and present the results in a new worksheet (which it does now). Then delete the temporary worksheet. What do you think? Or is there a way to leave everything as is but to direct the macro to ignore spaces and punctuation when comparing the strings?

Thanks again shg!!!

abousetta

CORRECTION: There is no error. Check my later post to see what I was doing wrong.

**shg** · 08-01-2010, 02:17 PM

I don't know why you say it's wrong. It reports the edit distances accurately.

The macro seems to compare the cell with the next cell only and ignores all other cells in the same column.

Change the constant MAXDIST to something larger than it's current value, which is 4.

**abousetta** · 08-01-2010, 08:42 PM

Hi shg,

I apologize for wrongly identifying an error when in fact there was none. You are correct. The problem was not with the punctuation as I thought was the case, but in fact it was just the MAXDIST value as you correctly pointed out. When I increased this and re-ran the macro it worked correctly. You were thinking one step ahead of me when you wrote the macro and correctly considered that with more than four iterations, that the probability of the two strings to be similar was minute and therefore there was no reason for the macro to keep comparing the two strings if they were obviously not the same.

Once again a thousand apologies and a million thanks for a job well done.

abousetta

**abousetta** · 08-02-2010, 02:40 AM

Hi shg,

Just a question regarding efficiency. Would there be a way to make the Function EditDiststop if the MAXDIST has been reached? I tested the macro with large strings and it takes relatively the same amount of time to compare two strings with one deletion or 100 deletions. In the former case, the macro prints the result but in the latter case it just discards them and moves to the next compariosn. Is there a way to stop the compariosn from running once it reaches teh MAXDIST to speed up the process?

Thanks for your insight.

abousetta

**shg** · 08-02-2010, 05:17 PM

Replace the code with this:

Please Login or Register  to view this content.

**abousetta** · 08-02-2010, 11:52 PM

Hi shg,

Thanks for all your help. I have run and compared the results but must be doing something wrong. The modified macro works just like the original if the MAXDIST is a large number (e.g. 10000000) in order to capture all changes. When I drop this number down to a realistic number, it does not return anything.

Example... In the attachement I have two identical strings to compare with only one difference (the first letter of the second string is deleted). If I use the large MAXDIST then it returns the value of 1 which is correct. When I change the MAXDIST to for example 6 it returns a blank; which from my understanding means that the number of changes is above 6 but isn't the actual case.

Could you please help me understand what I am doing wrong?

Thanks again.

abousetta

**shg** · 08-03-2010, 12:42 AM

I think it is the case that I don't understand the EditDist algorithm well enough to short-circuit it as I tried to do. I don't have another suggestion ATM, abousetta, sorry.

**shg** · 08-03-2010, 01:10 AM

Actually, try this:

Please Login or Register  to view this content.

**abousetta** · 08-03-2010, 01:25 AM

Thanks shg!!!! Excellent... It seems to work now. I will keep testing different scenarios but am confident that everything works as planned.

Thanks again.

abousetta

Indentifying potential duplicates

LinkBack

Thread Tools

Rate This Thread

Display

Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Re: Indentifying potential duplicates

Thread Information

Users Browsing this Thread

Bookmarks

Bookmarks

Posting Permissions