I have a single column where each cell contains multiple words in any order.
I need to find which group of N words in a cell exist in the greatest number of rows (or cells for this 1-column case).
Example: (using letters for simplicity)
In the data below, each row of letters is one cell.
I want to determine which 3 letters in a cell exist in the greatest number of cells?
I believe B, E, and F (rows 1, 3 and last row).
Edit: (also B, F, H in rows 1, 4 and last row - discovered later)
F H E B A
D B A G
E B A F C
B H C F
C G D
D F A H
G E F D
C B H D A
B H F E D
Like this example, each cell contains only a few words (maybe 10 max), but there are many different words and hundreds of rows, however all the words could be extracted and individually listed if necessary, e.g. used in a lookup table, separate array, etc.
Any insights appreciated!
Last edited by Buzzed Aldrin; 02-13-2012 at 06:26 AM. Reason: Clarify
Pl see attached file.
Hi Aldrin,
Try looking at Advanced Filter feature of Excel with wildcard usage. Thanks.
Regards,
DILIPandey
<click on below 'star' if this helps>
DILIPandey
+919810929744
dilipandey@gmail.com
@kvsrini... That works when the words are known, but in my application the words to specify in the formula are unknown. I just used B,E,F to illustrate, because I already "knew" that those letters occurred most often (3 rows).
Even with the tiny example data set, it's not a trivial thing to manually consider all the 3-item combinations for each row, which is why I said I "believe" B,E,F occurs most oftenMy problem is the "reverse" situation - I need a way to find out the most common group of items.
So the example can be worded better: Which 3-letter group exists in the most rows? Answer: B,E,F
@dilipandy Never considered Advanced Filter, thanks for mentioning, will check it out.
Hi
is the range of possible words fixed? otherwise you quicky hit a lot of permutations - a list of 100 unique words contains 161,700 possible combinations of 3 different words.
also, are there just words, or other information too (numbers, punctuation etc)
The range of words is not fixed/limited, but there's only a few words per cell (say 3-10). Yes, the permutative load becomes quickly apparent just mentally going through the dinky example dataset.is the range of possible words fixed? otherwise you quicky hit a lot of permutations - a list of 100 unique words contains 161,700 possible combinations of 3 different words.
No, strictly text words, no punctuation or junk characters.also, are there just words, or other information too (numbers, punctuation etc)
Hi
I have probably made this far more complex and ugly than it needs to be, but it seems to work. The trick seems to be to eliminate words that appear infrequently to reduce the word list to a manageable level. If all words occur with about the same frequency this will not work.
You need to have your source words in a worksheet named "source" and a separate blank sheet in the workbook called "words" to list and manipulate data.
the macros should run in the order listed (or the macro a_all_macros will run them for you):
a_words_list
copies each word in the cells in sheet "source" to a separate row in sheet "words"
b_sort_words
sorts them alphabetically
c_count_frequencies
counts the frequencies with which each word occurs
d_delete_duplicates
deletes all words that occur only once (these can't be the most frequent) and all duplicates, leaving just one instance of each word in the list
e_sort_by_frequency
sorts the remaining words by descending order of frequency
f_find_most_frequent_delete_others
for the top 3 most frequent words, counts the number of occasions on which all three words occur in one cell in sheet "source". Deletes all words that occur less frequently than this (they cannot be in the most frequent 3)
hopefully, this will reduce the list of words to a manageable length (20 or less words). Otherwise, it will probably not work.
g_check_nos
If so, the final macro determines all permutations of 3 words in the list and checks their frequency in sheet “source”, returning the most frequent
Sub a_all_macros() a_words_list b_sort_words c_count_frequencies d_delete_duplicates e_sort_by_frequency f_find_most_frequent_delete_others g_check_nos End Sub Sub a_words_list() Dim arnos As Variant, n As Long, sn As Long, MyWord As Variant Sheets("words").Range("A1:b1").EntireColumn.Delete sn = 1 For Each cell In Sheets("source").UsedRange arrNos = Split(cell.Value, " ") For n = LBound(arrNos) To UBound(arrNos) MyWord = LCase(Trim(arrNos(n))) If Len(MyWord) > 0 Then Sheets("words").Columns(1).Rows(sn).Value = MyWord sn = 1 + sn End If Next n Next cell Sheets("words").Select 'sort_and_delete_duplicates End Sub Sub b_sort_words() Dim LastRow As Long, SortRange, wn As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row SortRange = Range("A1", Columns(1).Rows(LastRow).Address).Address Range(SortRange).Sort Key1:=Range("A1"), Order1:=xlAscending, Header:=xlNo, OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, DataOption1:=xlSortNormal End Sub Sub c_count_frequencies() Dim LastRow As Long, SortRange, wn As Long, az As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row SortRange = Range("A1", Columns(1).Rows(LastRow).Address).Address For az = 1 To LastRow 'Sheets("words").Columns(2).Rows(az).Formula = "=countif(" & SortRange & ",A" & az & ")" Sheets("words").Columns(2).Rows(az).Value = Application.WorksheetFunction.CountIf(Range("A1", Columns(1).Rows(LastRow).Address), Sheets("words").Columns(1).Rows(az).Value) Next az End Sub Sub d_delete_duplicates() Dim LastRow As Long, qwn As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row Application.EnableCancelKey = xlDisabled 'delete singletons For qwn = LastRow To 1 Step -1 If Sheets("words").Columns(2).Rows(qwn).Value = 1 Then Sheets("words").Rows(qwn).EntireRow.Delete Next qwn 'delete duplicates LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row For qwn = LastRow To 2 Step -1 If Sheets("words").Columns(1).Rows(qwn).Value = Sheets("words").Columns(1).Rows(qwn - 1).Value Then Sheets("words").Columns(1).Rows(qwn).EntireRow.Delete Next qwn Application.EnableCancelKey = xlenabled End Sub Sub e_sort_by_frequency() Dim LastRow As Long, SortRange, wn As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row SortRange = Range("A1", Columns(2).Rows(LastRow).Address).Address Range(SortRange).Sort Key1:=Range("b1"), Order1:=xlDescending, Header:=xlNo, OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, DataOption1:=xlSortNormal End Sub Sub f_find_most_frequent_delete_others() Dim LastRow As Long, stp As Boolean, arnos As Variant, n As Long, MyWord As Variant, sn As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row MaxNo = 0 txt = "" For Z = 1 To 10 Word1 = Sheets("words").Columns(1).Cells(1).Value word2 = Sheets("words").Columns(1).Cells(Z + 1).Value word3 = Sheets("words").Columns(1).Cells(Z + 2).Value fn = 0 For Each cell In Sheets("source").UsedRange tc1 = 0 tc2 = 0 tc3 = 0 arrNos = Split(cell.Value, " ") For n = LBound(arrNos) To UBound(arrNos) MyWord = LCase(Trim(arrNos(n))) If Len(MyWord) > 0 Then If MyWord = Word1 Then tc1 = tc1 + 1 If MyWord = word2 Then tc2 = tc2 + 1 If MyWord = word3 Then tc3 = tc3 + 1 End If Next n If tc1 > 0 And tc2 > 0 And tc3 > 0 Then fn = fn + 1 Next cell 'Delete words with lower frequency Application.EnableCancelKey = xlDisabled 'delete singletons For qwn = LastRow To 1 Step -1 If Sheets("words").Columns(2).Rows(qwn).Value < fn Then Sheets("words").Rows(qwn).EntireRow.Delete Next qwn Application.EnableCancelKey = xlenabled If fn > MaxNo Then MaxNo = fn txt = Word1 & " " & word2 & " " & word3 End If Next Z End Sub Sub g_check_nos() Dim LastRow As Long, aa As Long, bb As Long, cc As Long, fn As Long, MaxNo As Long, txt As String, res As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row If LastRow > 20 Then res = MsgBox("the list contains more than 20 words and will take a long time to process, are you sure you wish to proceed?", 4) If res = 7 Then End MaxNo = 0 For aa = 1 To LastRow Word1 = Sheets("words").Columns(1).Rows(aa).Value For bb = aa + 1 To LastRow word2 = Sheets("words").Columns(1).Rows(bb).Value For cc = bb + 1 To LastRow word3 = Sheets("words").Columns(1).Rows(cc).Value fn = 0 For Each cell In Sheets("source").UsedRange tc1 = 0 tc2 = 0 tc3 = 0 arrNos = Split(cell.Value, " ") For n = LBound(arrNos) To UBound(arrNos) MyWord = LCase(Trim(arrNos(n))) If Len(MyWord) > 0 Then If MyWord = Word1 Then tc1 = tc1 + 1 If MyWord = word2 Then tc2 = tc2 + 1 If MyWord = word3 Then tc3 = tc3 + 1 End If Next n If tc1 > 0 And tc2 > 0 And tc3 > 0 Then fn = fn + 1 If fn > MaxNo Then MaxNo = fn txt = Word1 & ", " & word2 & ", " & word3 End If Next cell Next cc Next bb Next aa MsgBox "these three words occur in " & MaxNo & " cells: " & txt End Sub
Last edited by NickyC; 02-06-2012 at 10:43 PM.
Sorry for not getting back on this.
I've had to reformat my HD and am now picking up the pieces
Will get back asap.
@NickyC - Holy cow, hope you didn't feel compelled to find a solution after getting sucked into a challenge. Big thanks for the effort even if it doesn't work!
When I run the macro, I'm getting "Run-time error '9': Subscript out of range"
Problem with code or the way I'm running it?
I have essentially zero VB experience so I can't (intelligently) mod your code.
Hi
can you check that the workbook has sheets named "source" and "words"
I have tweaked the code, which may enable it to work. if not, can you check to see at what point it breaks down (the row where it gets stuck in the VBA should be highlighted in yellow).
Sub a_all_macros() a_words_list b_sort_words c_count_frequencies d_delete_duplicates e_sort_by_frequency f_find_most_frequent_delete_others g_check_nos End Sub Sub a_words_list() Dim arrnos As Variant, n As Long, sn As Long, MyWord As Variant, Xcell Sheets("words").Range("A1:b1").EntireColumn.Delete sn = 1 For Each Xcell In Sheets("source").UsedRange.Cells arrnos = Split(Xcell.Value, " ") For n = LBound(arrnos) To UBound(arrnos) MyWord = LCase(Trim(arrnos(n))) If Len(MyWord) > 0 Then Sheets("words").Columns(1).Rows(sn).Value = MyWord sn = 1 + sn End If Next n Next Xcell Sheets("words").Select 'sort_and_delete_duplicates End Sub Sub b_sort_words() Dim LastRow As Long, SortRange, wn As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row SortRange = Range("A1", Columns(1).Rows(LastRow).Address).Address Range(SortRange).Sort Key1:=Range("A1"), Order1:=xlAscending, Header:=xlNo, OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, DataOption1:=xlSortNormal End Sub Sub c_count_frequencies() Dim LastRow As Long, SortRange, wn As Long, az As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row SortRange = Range("A1", Columns(1).Rows(LastRow).Address).Address For az = 1 To LastRow 'Sheets("words").Columns(2).Rows(az).Formula = "=countif(" & SortRange & ",A" & az & ")" Sheets("words").Columns(2).Rows(az).Value = Application.WorksheetFunction.CountIf(Range("A1", Columns(1).Rows(LastRow).Address), Sheets("words").Columns(1).Rows(az).Value) Next az End Sub Sub d_delete_duplicates() Dim LastRow As Long, qwn As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row Application.EnableCancelKey = xlDisabled 'delete singletons For qwn = LastRow To 1 Step -1 If Sheets("words").Columns(2).Rows(qwn).Value = 1 Then Sheets("words").Rows(qwn).EntireRow.Delete Next qwn 'delete duplicates LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row For qwn = LastRow To 2 Step -1 If Sheets("words").Columns(1).Rows(qwn).Value = Sheets("words").Columns(1).Rows(qwn - 1).Value Then Sheets("words").Columns(1).Rows(qwn).EntireRow.Delete Next qwn Application.EnableCancelKey = xlInterrupt End Sub Sub e_sort_by_frequency() Dim LastRow As Long, SortRange, wn As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row SortRange = Range("A1", Columns(2).Rows(LastRow).Address).Address Range(SortRange).Sort Key1:=Range("b1"), Order1:=xlDescending, Header:=xlNo, OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, DataOption1:=xlSortNormal End Sub Sub f_find_most_frequent_delete_others() Dim LastRow As Long, stp As Boolean, arrnos As Variant, n As Long, MyWord As Variant, sn As Long, MaxNo As Long, txt As String, z As Long, word1 As String, word2 As String, word3 As String, zcell Dim tc1 As Long, tc2 As Long, tc3 As Long, fn As Long, qwn As Long LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row MaxNo = 0 txt = "" For z = 1 To 10 word1 = Sheets("words").Columns(1).Cells(1).Value word2 = Sheets("words").Columns(1).Cells(z + 1).Value word3 = Sheets("words").Columns(1).Cells(z + 2).Value fn = 0 For Each zcell In Sheets("source").UsedRange.Cells tc1 = 0 tc2 = 0 tc3 = 0 arrnos = Split(zcell.Value, " ") For n = LBound(arrnos) To UBound(arrnos) MyWord = LCase(Trim(arrnos(n))) If Len(MyWord) > 0 Then If MyWord = word1 Then tc1 = tc1 + 1 If MyWord = word2 Then tc2 = tc2 + 1 If MyWord = word3 Then tc3 = tc3 + 1 End If Next n If tc1 > 0 And tc2 > 0 And tc3 > 0 Then fn = fn + 1 Next zcell 'Delete words with lower frequency Application.EnableCancelKey = xlDisabled 'delete singletons For qwn = LastRow To 1 Step -1 If Sheets("words").Columns(2).Rows(qwn).Value < fn Then Sheets("words").Rows(qwn).EntireRow.Delete Next qwn Application.EnableCancelKey = xlInterrupt If fn > MaxNo Then MaxNo = fn txt = word1 & " " & word2 & " " & word3 End If Next z End Sub Sub g_check_nos() Dim LastRow As Long, aa As Long, bb As Long, cc As Long, fn As Long, MaxNo As Long, txt As String, res As Long, zcell, n As Long Dim word1 As String, word2 As String, word3 As String, tc1 As Long, tc2 As Long, tc3 As Long, arrnos As Variant, MyWord As String LastRow = Sheets("words").Cells(Rows.Count, 1).End(xlUp).Row If LastRow > 20 Then res = MsgBox("the list contains more than 20 words and will take a long time to process, are you sure you wish to proceed?", 4) If res = 7 Then End MaxNo = 0 For aa = 1 To LastRow word1 = Sheets("words").Columns(1).Rows(aa).Value For bb = aa + 1 To LastRow word2 = Sheets("words").Columns(1).Rows(bb).Value For cc = bb + 1 To LastRow word3 = Sheets("words").Columns(1).Rows(cc).Value fn = 0 For Each zcell In Sheets("source").UsedRange.Cells tc1 = 0 tc2 = 0 tc3 = 0 arrnos = Split(zcell.Value, " ") For n = LBound(arrnos) To UBound(arrnos) MyWord = LCase(Trim(arrnos(n))) If Len(MyWord) > 0 Then If MyWord = word1 Then tc1 = tc1 + 1 If MyWord = word2 Then tc2 = tc2 + 1 If MyWord = word3 Then tc3 = tc3 + 1 End If Next n If tc1 > 0 And tc2 > 0 And tc3 > 0 Then fn = fn + 1 If fn > MaxNo Then MaxNo = fn txt = word1 & ", " & word2 & ", " & word3 End If Next zcell Next cc Next bb Next aa MsgBox "these three words occur in " & MaxNo & " cells: " & txt End Sub
Hi
Slightly different approach. This will allow you to nominate the number of words in your groupings. I've only programmed for 3 and 4 word combinations just so you get the idea.
I've also hard coded the range to process - again just to give you an idea. This can be automated, but the output will have to be put into a different place.
Using the above for a 3 word group, I noticed that F, B and H also have 3 rows...Sub aaa() Dim dic As Object Set dic = CreateObject("Scripting.dictionary") 'nominate the number of words in the group wordcnt = 4 Set srchrng = Range("A1:A9") Range("E:I").ClearContents For Each ce In srchrng arr = Split(ce, " ") For i = LBound(arr) To UBound(arr) If Not dic.exists(arr(i)) Then dic.Add Item:=1, Key:=arr(i) Else dic(arr(i)) = dic(arr(i)) + 1 End If Next i Next ce For Each ce In dic outrow = Cells(Rows.Count, "E").End(xlUp).Offset(1, 0).Row Cells(outrow, "E").Value = ce Cells(outrow, "F").Value = dic(ce) Next ce Range("E:F").Sort Key1:=Range("F1"), Order1:=xlDescending, Header:=xlNo 'build work groups Select Case wordcnt Case 3 x = Cells(Rows.Count, "E").End(xlUp).Row For i = 1 To x - 2 For j = i + 1 To x - 1 For k = j + 2 To x Cells(Rows.Count, "H").End(xlUp).Offset(1, 0).Value = Cells(i, "E") & " " & Cells(j, "E") & " " & Cells(k, "E") Next k Next j Next i Case 4 x = Cells(Rows.Count, "E").End(xlUp).Row For i = 1 To x - 3 For j = i + 1 To x - 2 For k = j + 1 To x - 1 For l = k + 1 To x Cells(Rows.Count, "H").End(xlUp).Offset(1, 0).Value = Cells(i, "E") & " " & Cells(j, "E") & " " & Cells(k, "E") & " " & Cells(l, "E") Next l Next k Next j Next i End Select 'process word groups For Each ce In Range("H2:H" & Cells(Rows.Count, "H").End(xlUp).Row) appearcnt = 0 For Each ce2 In srchrng foundit = True wordarr = Split(ce, " ") For i = LBound(wordarr) To UBound(wordarr) If InStr(1, ce2, wordarr(i)) = 0 Then foundit = False Next i If foundit Then appearcnt = appearcnt + 1 Next ce2 ce.Offset(0, 1).Value = appearcnt Next ce End Sub
rylo
Last edited by rylo; 02-12-2012 at 11:43 PM. Reason: realised it had to cover words not single characters...
@rylo and NickyC => both versions, outstanding!
Yep - F,B,H is in there 3 times also, shows how tedious and error-prone this analysis would be if attempted manually on even a small data set.
2- and 3-word macros are probably all I need; I can follow your codes and mod for other N-values if nec.
Thank you very much - This would have taken me a long time, if even feasible, using formulas and a rat's nest of helper columns.
@rylo - In your code, for Case 3 in the Select Case section, "For k = j + 2 To x" should be "For k = j + 1 To x" correct?
Buzzed
Correct. My bad. When I first did this, I realised the mistake and corrected for Case 4, but didn't see the problem for case 3.
rylo
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks