Hi all,
I have a column with 10,000+ lines, and lots of them are duplicated.
I have a formula that tells me how many times they are duplicated, but it only looks for 100% matching cells.
Now, lots of cells are partially duplicated, for example:
A1 - Team Abcde
A2 - Team Abcde Fghi
B1 - Team Abc
I would like to have a formula giving me a percentage of how two cells are similar to each others by checking the words.
In this case, B1 is 50% similar to A1 because only the word 'Team' is duplicated.
B1 is only 33,34% similar to A2.
I hope any of you can give me a hand on this, thanks :-)
Last edited by paokun; 04-26-2010 at 11:04 AM.
disregard this post...I had a formula here but it didn't work.
Last edited by Rebuild8; 04-19-2010 at 10:39 AM.
I was able to do this with a Levenshtein formula. I borrowed the VBA algorithm from here:
http://en.wikibooks.org/wiki/Algorit...htein_distance
Insert a module into your workbook by hitting Alt-F11, right-click vbaproject, insert > module. Paste this code into the window:
Now you can use this formula to compare A1 and B1:Function levenshtein(a As String, b As String) As Integer Dim i As Integer Dim j As Integer Dim cost As Integer Dim d() As Integer Dim min1 As Integer Dim min2 As Integer Dim min3 As Integer If Len(a) = 0 Then levenshtein = Len(b) Exit Function End If If Len(b) = 0 Then levenshtein = Len(a) Exit Function End If ReDim d(Len(a), Len(b)) For i = 0 To Len(a) d(i, 0) = i Next For j = 0 To Len(b) d(0, j) = j Next For i = 1 To Len(a) For j = 1 To Len(b) If Mid(a, i, 1) = Mid(b, j, 1) Then cost = 0 Else cost = 1 End If min1 = (d(i - 1, j) + 1) min2 = (d(i, j - 1) + 1) min3 = (d(i - 1, j - 1) + cost) d(i, j) = Application.WorksheetFunction.Min(min1, min2, min3) Next Next levenshtein = d(Len(a), Len(b)) End Function
This might compare a bit more accurately than you were asking for, as it is on a character-by-character basis rather than word-by-word (among other things). This gives matches of 80% and 53.33%, but I think it should suit your purposes.=(MAX(LEN(A1), LEN(B1)) - levenshtein(A1,B1))/MAX(LEN(A1), LEN(B1))
Hi, thanks a lot for your reply!
I've tried the formula and it works, thing is that when I have, for example:
A1 - A look at the sea.
B1 - kkkkkkkk kkkkkk kkk
It gives me a 10.53% match, but I'd like to have 0%.
Also I would prefer to check by words. For example:
A1 - A look at the sea.
B1 - Beach look at the a.
Without comparing the words order, out of 5 words, only 4 match, so I'd like to have 80% as a result.
Pretty complicated I guess, but since you are geniuses I'm sure it's worth asking.
Bump.. anyone?
I think you need to clarify what happens regards
a) punctuation, eg A and a. are not a match
b) case sensitivity (eg A and a are not a match per se)
and what happens in the case of say a common word with differing frequencies, eg:
A1: my the weather is lovely
B1: my oh my what wonderful weather
In this instance what is the % output ?
Given B seemingly takes precedence in calculating % of the 6 words used in B1 3 are found, however, "my" appears with differing frequency - once in A1 and twice in B1.
My Recommended Reading:
Volatility
Sumproduct & Arrays
Pivot Intro
Email from XL - VBA & Outlook VBA
Function Dictionary & Function Translations
Dynamic Named Ranges
Hi,
Thanks a lot for your reply and questions!
a) punctuation, eg A and a. are not a match
No need to consider punctuation.
b) case sensitivity (eg A and a are not a match per se)
'Aurora' and 'aurora' should be considered different words.
Regarding the frequency words appear, for example:
A1: my the weather is lovely
B1: my oh my what wonderful weather
Each repeated word can be considered only once. So, B1 has 5 words, two of which appear in A1 too, so the output should be 40%.
Still, complicated, but I appreciate any help! Thanks a lot.
The case sensitivity makes this a little more complex I think...
the above, stored in a module, could be called from a cell along the lines of:Function CompareString(rngS1 As Range, rngS2 As Range, Optional boolCase As Boolean = True) As Double Dim vW1, vW2, oDic As Object, lngW As Long, lngU As Long, lngM As Long, strTemp As String vW1 = Split(rngS1, " ") vW2 = Split(rngS2, " ") Set oDic = CreateObject("Scripting.Dictionary") For lngW = LBound(vW2) To UBound(vW2) Step 1 strTemp = Replace(vW2(lngW), ".", "") With oDic If Not .exists(strTemp) Then lngU = lngU + 1 .Add strTemp, lngU If boolCase Then lngM = lngM + rngS1.Parent.Evaluate("SUMPRODUCT(--ISNUMBER(FIND("" " & strTemp & " "","" ""&SUBSTITUTE(" & rngS1.Address & ",""."","""")&"" "")))") Else lngM = lngM - IsNumeric(Application.Match(strTemp,vW1,0)) End If End If End With Next lngW Set oDic = Nothing CompareString = lngM / lngU End Function
this would generate the results outlined for your examples with the exception of:=COMPARESTRING(A1,B1)
which given stipulated case sensitivity should actually return 60% given a <> A.Originally Posted by paokun
The UDF has optional third parameter (Boolean) for Case Sensitivity - default being True - if explicitly set to False in the call you would get the 80%, eg:
I hope that helps.=COMPARESTRING(A1,B1,FALSE)
Note: in the UDF only a basic punctuation assumption is made (eg period delimiter) - pending complexity of your real strings you will most likely need to elaborate on this part.
My Recommended Reading:
Volatility
Sumproduct & Arrays
Pivot Intro
Email from XL - VBA & Outlook VBA
Function Dictionary & Function Translations
Dynamic Named Ranges
Wow, this is fantastic stuff!
Thanks a lot for your quick reply, I really appreciate it.
It works like a charm, and being able to use the 'false' tag is also extremely helpful.
I've got one last question. Would it be possible to compare B1 with a range of cells in A, and show as a result the highest matching percentage?
Edit: .. and show as a result the highest matching percentage and the best matching cell?
For example:
C1 - 86%
D1 - my the weather is lovely (<- text content of the most similar cell)
Last edited by paokun; 04-21-2010 at 08:38 AM.
Edit: Unnecessary.
Last edited by SpeedingLunatic; 04-21-2010 at 11:27 AM.
I'm not quite sure what you're asking for... if you want the UDF to accept multiple strings for comparative purposes then perhaps the below adaptation might help ?
the above is now called along the lines of:Function CompareString(rngS1 As Range, rngS2 As Range, strType As String, Optional boolCase As Boolean = True) As Variant Dim vW1, vW2 Dim oDic As Object Dim lngW As Long, lngU As Long, lngM As Long, lngTemp As Long, rngCell As Range Dim strTemp As String, strB As String vW2 = Split(rngS2.Text, " ") Set oDic = CreateObject("Scripting.Dictionary") For lngW = LBound(vW2) To UBound(vW2) Step 1 strTemp = Replace(vW2(lngW), ".", "") With oDic If Not .exists(strTemp) Then lngU = lngU + 1 .Add strTemp, lngU End If End With Next lngW Set oDic = Nothing For Each rngCell In rngS1.Cells If rngCell.Value <> "" Then If rngCell.Value = strTemp Then lngM = lngU strB = strTemp Else vW1 = Split(rngCell.Text, " ") lngTemp = 0 For lngW = LBound(vW2) To UBound(vW2) Step 1 strTemp = vW2(lngW) If boolCase Then lngTemp = lngTemp + rngS1.Parent.Evaluate("SUMPRODUCT(--ISNUMBER(FIND("" " & strTemp & " "","" ""&SUBSTITUTE(" & rngCell.Address & ",""."","""")&"" "")))") Else lngTemp = lngTemp - IsNumeric(Application.Match(strTemp, vW1, 0)) End If Next lngW If lngTemp > lngM Then lngM = lngTemp strB = rngCell.Text End If End If End If Next rngCell Select Case UCase(strType) Case "P" CompareString = lngM / lngU Case "S" CompareString = strB End Select End Function
to return the greatest percentageC1: =COMPARESTRING($A$1:$A$4,$B$1,"P")
to return the associated stringD1: =COMPARESTRING($A$1:$A$4,$B$1,"S")
In the case of matching "max" % the first string is returned and as such the above code would stop should an exact match be found - ie it won't iterate through additional cells unnecessarily.
As before you have the optional Boolean as final parameter to denote case sensitivity (default is TRUE - ie Case Sensitive matching)
(edit: and apologies for slight delay in responding - I tend to pop in and out of the forum in the afternoons)
Last edited by DonkeyOte; 04-21-2010 at 10:20 AM.
My Recommended Reading:
Volatility
Sumproduct & Arrays
Pivot Intro
Email from XL - VBA & Outlook VBA
Function Dictionary & Function Translations
Dynamic Named Ranges
Hi, apologies from my side for the delay, I've been out on a business trip and didn't get a minute to access the forum.
I've just tried your solution and it works like a charm, you're a genius! I'm impressed, just copying and pasting your code makes the trick.
Thanks a lot, I really appreciate your big help! :-)
Edit!
If I compare exactly the same sentence:
A1 - Mark scored an absolutely great point.
B1 - Mark scored an absolutely great point.
it gives me 83.33%. Removing the dot in the end would give me a 100% match.
Would it be possible to have 100% even with the dots?
Also, I've got some cells with line-breaks:
A1 - Mark scored an absolutely
great point.
B1 - Mark scored an absolutely great point.
Although the content is a 100%, because of the line-break the percentage goes down to 50%.
Also, the result text I get in D1 is:
D1 - Mark scored an absolutelygreat point.
It would be best to have it on two lines like A1.
Sorry to keep pestering you, but we're almost there!
Going forward it would be a lot easier if you posted sample files so we can ensure we're both dealing with the same variations.
Based on your post perhaps:
Note however that regards:Function CompareString(rngS1 As Range, rngS2 As Range, strType As String, Optional boolCase As Boolean = True) As Variant Dim vW1, vW2 Dim oDic As Object Dim lngW As Long, lngU As Long, lngM As Long, lngTemp As Long, rngCell As Range Dim strTemp As String, strC As String, strB As String vW2 = Split(Application.WorksheetFunction.Trim(Replace(Replace(rngS2.Text, ".", ""), Chr(10), " ")), " ") Set oDic = CreateObject("Scripting.Dictionary") For lngW = LBound(vW2) To UBound(vW2) Step 1 strTemp = vW2(lngW) With oDic If Not .exists(strTemp) Then lngU = lngU + 1 .Add strTemp, lngU End If End With Next lngW Set oDic = Nothing For Each rngCell In rngS1.Cells strC = Application.WorksheetFunction.Trim(Replace(Replace(rngCell.Text, ".", ""), Chr(10), " ")) If strC <> "" Then If strC = rngS2.Text Then lngM = lngU strB = rngS2.Text Else vW1 = Split(strC, " ") lngTemp = 0 For lngW = LBound(vW2) To UBound(vW2) Step 1 strTemp = vW2(lngW) If boolCase Then lngTemp = lngTemp + rngS1.Parent.Evaluate("SUMPRODUCT(--ISNUMBER(FIND("" " & strTemp & " "","" " & strC & " "")))") Else lngTemp = lngTemp - IsNumeric(Application.Match(strTemp, vW1, 0)) End If Next lngW If lngTemp > lngM Then lngM = lngTemp strB = rngCell.Text End If End If End If Next rngCell Select Case UCase(strType) Case "P" CompareString = lngM / lngU Case "S" CompareString = strB End Select End Function
this can only be achieved by altering the word-wrap setting on the cell itself (via Formatting) - once done the result will be per A1 (ie the line break is present it's simply not visible given the cell formatting) ... the UDF can not alter the cell's format.Originally Posted by paokun
My Recommended Reading:
Volatility
Sumproduct & Arrays
Pivot Intro
Email from XL - VBA & Outlook VBA
Function Dictionary & Function Translations
Dynamic Named Ranges
Thank you very much again, that was quick!
Now it's perfect, it does exactly what I need.
Thank you again very much for your big support :-)
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks