+ Reply to Thread
Page 1 of 2 12 LastLast
Results 1 to 15 of 20
  1. #1
    Valued Forum Contributor
    Join Date
    11-12-2007
    Location
    Germany
    MS-Off Ver
    2007
    Posts
    389

    Remove duplicates from col.A and merge col.B

    Hello,

    i have a wordlist containing about 0.5 Million words in col.A and Millions of meanings in Col.B. Now in Col. A there are many duplicates. I would like to make one cell of all duplicates in col.A and put their meanings together in B with seperator ";". Its very important that there are no duplicates in Col. B.

    For example:
    Col. A Col.B
    w1 m1;m2;m3
    w1 m1;m2;m5
    w2 m23;m24
    w2 m50

    Now after running the macro the list should look like:

    Col.A Col.B
    w1 m1;m2;m3;m5
    w2 m23;m24;m50

    The words in col. A are alphabetically ordered and all the duplicates are followed by each other.

    Thanks for your help in advance.

  2. #2
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,498
    Hello wali,

    I am working on a solution. Given the amount of data, I am testing different methods to obtain the fastest method. Hopefully, I will have a solution for you in few hours.

    Sincerely,
    Leith Ross

  3. #3
    Valued Forum Contributor
    Join Date
    11-12-2007
    Location
    Germany
    MS-Off Ver
    2007
    Posts
    389
    Hi Leith Ross,
    thank you very much. I will be looking forward for your solution. Its quite a big list. I hope that it will work.

  4. #4
    Forum Guru mikerickson's Avatar
    Join Date
    03-30-2007
    Location
    Davis CA
    MS-Off Ver
    Excel 2011
    Posts
    2,929
    This Sub/UDF combination should work.
    Run ConsolidateData after changing the sheet name to match your situation.
    Code:
    Sub ConsolidateData()
    With ThisWorkbook.Sheets("Sheet1")
        Range("A:B").Insert shift:=xlToRight
        With .Range("C:C")
            With Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
                .Offset(1, -1).Resize(.Rows.Count - 1, 1).FormulaR1C1 = "=RemoveDups(IF(R[-1]C[1]=RC[1],R[-1]C&"";"","""") &RC[2],"";"")"
                .Offset(0, -1).Range("A1").FormulaR1C1 = "=RC[2]"
                .Offset(0, 1).Value = .Offset(0, -1).Value
                With .Offset(0, -2)
                    .FormulaR1C1 = "=1/((RC[2]=R[1]C[2])-1)"
                    On Error Resume Next
                        .SpecialCells(xlCellTypeFormulas, xlErrors).EntireRow.Delete
                    On Error GoTo 0
                End With
            End With
        End With
        Range("A:B").EntireColumn.Delete shift:=xlToLeft
    End With
    End Sub
    
    Function RemoveDups(delimitedString As String, Optional ByVal Delimiter As String) As String
    Dim inRRay As Variant, oneString As Variant
    If Delimiter = vbNullString Then Delimiter = " "
    inRRay = Split(delimitedString, Delimiter)
    For Each oneString In inRRay
        oneString = Trim(oneString)
        If oneString <> vbNullString And Not (CBool(InStr(RemoveDups & Delimiter, Delimiter & oneString & Delimiter))) Then
            RemoveDups = RemoveDups & Delimiter & oneString
        End If
    Next oneString
    RemoveDups = Mid(RemoveDups, Len(Delimiter) + 1)
    End Function
    _
    ...How to Cross-post politely...
    ..Wrap code by selecting the code and clicking the # or read this. Thank you.

  5. #5
    Valued Forum Contributor
    Join Date
    11-12-2007
    Location
    Germany
    MS-Off Ver
    2007
    Posts
    389
    Hello Mikerickson,
    thank you very much for your code. But unfortunately it removes all my data. The sheet1 is blank after running the code.
    There is only one sheet named "sheet1" and i run your code and each time it delets all data. I dont know what i am doing worng. Or has it something to do with my excel language version. I am using German Excel 2007. And normaly the english formulas dont work for german version.
    Last edited by wali; 08-10-2008 at 06:38 PM.

  6. #6
    Forum Guru mikerickson's Avatar
    Join Date
    03-30-2007
    Location
    Davis CA
    MS-Off Ver
    Excel 2011
    Posts
    2,929
    Can you post a copy of your workbook? (it shouldn't be doing that)

    If its not possible to attach your workbook, try editing out these two line.
    Code:
                With .Offset(0, -2)
                    .FormulaR1C1 = "=1/((RC[2]=R[1]C[2])-1)"
                    On Error Resume Next
                        '.SpecialCells(xlCellTypeFormulas, xlErrors).EntireRow.Delete
                    On Error GoTo 0
                End With
            End With
        End With
        'Range("A:B").EntireColumn.Delete shift:=xlToLeft
    End With
    End Sub
    and reporting back the results of running that code. (That won't fix things but it will help diagnose.)
    _
    ...How to Cross-post politely...
    ..Wrap code by selecting the code and clicking the # or read this. Thank you.

  7. #7
    Valued Forum Contributor
    Join Date
    11-12-2007
    Location
    Germany
    MS-Off Ver
    2007
    Posts
    389
    Hi,
    here is a sample of my data.
    Now i installed english version of Excel but it didnt help. I also removed those two lines of code but they still give wrong result.
    As i tried it with smaller number of words, it didnt delet them but the second column doesnt give the wanted results. Please refer to the screenshots.
    Attached Images Attached Images
    Attached Files Attached Files
    Last edited by wali; 08-11-2008 at 03:11 AM.

  8. #8
    Forum Guru mikerickson's Avatar
    Join Date
    03-30-2007
    Location
    Davis CA
    MS-Off Ver
    Excel 2011
    Posts
    2,929
    Could you post a copy of the data before running the code? Attaching an Excel workbook would be more informative than a screen image.

    I'm thinking of two possible issues.
    a) the list in A is not sorted
    b) the delimiter might not be being read as ";"

    This version sorts the data first, bringing all of column A's duplicate entries together, and introduces a Delimiter variable to the routine, that can be changed to meet your situation. (The UDF is still needed)
    Code:
    Sub ConsolidateData()
    Dim Delimiter As String
    Delimiter = ";"
    With ThisWorkbook.Sheets("Sheet1")
        Range("A:B").Insert shift:=xlToRight
        With .Range("C:C")
            With Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
                .Sort Key1:=.Cells(1, 1), Order1:=xlAscending, Header:=xlGuess, _
                    OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom
                .Offset(1, -1).Resize(.Rows.Count - 1, 1).FormulaR1C1 = "=RemoveDups(IF(R[-1]C[1]=RC[1],R[-1]C&"";"","""") &RC[2]," & Chr(34) & Delimiter & Chr(34) & ")"
                .Offset(0, -1).Range("A1").FormulaR1C1 = "=RC[2]"
                .Offset(0, 1).Value = .Offset(0, -1).Value
                With .Offset(0, -2)
                    .FormulaR1C1 = "=1/((RC[2]=R[1]C[2])-1)"
                    On Error Resume Next
                        .SpecialCells(xlCellTypeFormulas, xlErrors).EntireRow.Delete
                    On Error GoTo 0
                End With
            End With
        End With
        Range("A:B").EntireColumn.Delete shift:=xlToLeft
    End With
    End Sub
    _
    ...How to Cross-post politely...
    ..Wrap code by selecting the code and clicking the # or read this. Thank you.

  9. #9
    Valued Forum Contributor
    Join Date
    11-12-2007
    Location
    Germany
    MS-Off Ver
    2007
    Posts
    389
    Hello,
    i have uploaded the excel file containing the data. In that file macro is not yet run. So its the original file. Please refer to my older message "c.zip" thats the excel file containing the data.

    Before running the macro i have sorted the data. So that cant be the reason.

    Now i tried your new code and that doesnt work either.

    What is UDF? Am i doing something wrong in running the code? Is there something else to be done eccept running the code?

    I pasted the code of yours in the code area and run it. It runs but gives the wrong results. Then i installed the english version of office and it didnt work either.

    Thanks
    Last edited by wali; 08-11-2008 at 05:02 PM.

  10. #10
    Forum Guru mikerickson's Avatar
    Join Date
    03-30-2007
    Location
    Davis CA
    MS-Off Ver
    Excel 2011
    Posts
    2,929
    I can't get any of the Arabic to concatenate. (which is crucial to all of the routine.)
    All the arabic characters return as CHAR(95) in my version.
    Let me see if I can find someone who has experience dealing with Arabic text.
    _
    ...How to Cross-post politely...
    ..Wrap code by selecting the code and clicking the # or read this. Thank you.

  11. #11
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,498
    Hello wali,

    Sorry for the delay, I have been working off site this week. The macro has run well and quickly for me (don't let the size put you off). It condenses the words and their definitions on the same page in columns "A" and "B". You can set the starting row, starting column, and the worksheet name to what you are using. I used "A" as the starting column, 1 as the starting row, and "Sheet2" as the worksheet. These variables are marked in red. I ran quite a variations of code and timed them, and this was the best. I don't have Excel 2007, but it should work. Make a copy of your worksheet before you run the macro. Let me know how it works.
    Code:
    Sub CondenseWithNoDups()
    
      Dim C As Long
      Dim Cn As Long
      Dim Defs As Object
      Dim I As Variant
      Dim LastRow As Long
      Dim N As Long
      Dim R As Long
      Dim Rng As Range
      Dim StartCol As Variant
      Dim StartRow As Long
      Dim Str As String
      Dim Wks As Worksheet
      Dim X As Variant
      
       'Setup variables
        StartCol = "A"
        StartRow = 1
        Set Wks = Worksheets("Sheet2")
        
       'Define worksheet area variables
        With Wks
          C = Wks.Cells(1, StartCol).Column
          Cn = C + 1
          LastRow = Wks.Cells(Rows.Count, C).End(xlUp).Row
          LastRow = IIf(LastRow < StartRow, StartRow, LastRow)
          Set Rng = .Range(.Cells(StartRow, Cn), .Cells(LastRow, Cn))
        End With
        
       'Create a Dictionary with case insensitive comparisons
        Set Defs = CreateObject("Scripting.Dictionary")
        Defs.CompareMode = vbTextCompare
        
        With Wks
          For R = StartRow To LastRow
            Str = .Cells(R, Cn)
            GoSub SplitData
            If .Cells(R, C) <> .Cells(R + 1, C) Then
             'Display the accumulated data less the final semi-colon
              GoSub ListDefinitions
              .Cells(R, Cn) = Left(Str, Len(Str) - 1)
              Defs.RemoveAll
              Str = ""
            Else
             'Clear the cell's data
              .Cells(R, Cn) = ""
            End If
          Next R
        End With
         
       'Delete the blank rows
        On Error Resume Next
          Set Rng = Rng.SpecialCells(xlCellTypeBlanks)
          If Err.Number = 0 Then Rng.EntireRow.Delete Shift:=xlShiftUp
        On Error GoTo 0
        
       'Free memory
        Set Defs = Nothing
           
      Exit Sub
          
    SplitData:
         'Split data where there is a semi-colon and remove any duplicates
          I = InStr(1, Str, ";")
          L = Len(Str) - I
            While I > 0
              Str = Left(Str, I)
              On Error Resume Next
                If Err.Number = 457 Then
                  Err.Clear
                End If
              Defs.Add Str, Str
              Str = Mid(Str, I + 1, L)
              I = InStr(I, Str, ";")
              L = L - I
            Wend
          Defs.Add Str, Str
          On Error GoTo 0
        Return
        
    ListDefinitions:
        Str = ""
        X = Defs.Keys
          For Each I In X
            Str = Str & I & ";"
          Next I
        Return
        
    End Sub
    Sincerely,
    Leith Ross

  12. #12
    Valued Forum Contributor
    Join Date
    11-12-2007
    Location
    Germany
    MS-Off Ver
    2007
    Posts
    389
    Hi Leith Ross,
    first of all thank you very much that you took the time to solve my problem. I am really very thankful for that.

    Unfortunately the code is not giving the wanted results. I am sending screenshots of how the results look like now and how they should be. I am also sending an example excel file.

    Thank you very much
    Attached Images Attached Images
    Attached Files Attached Files

  13. #13
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,498
    Hello wali,

    Here is the amended code. I caught the error you posted when I rechecked the code. This should be error free.
    Code:
    Sub CondenseWithNoDups()
    
      Dim C As Long
      Dim Cn As Long
      Dim Defs As Object
      Dim I As Long
      Dim LastRow As Long
      Dim N As Long
      Dim R As Long
      Dim Rng As Range
      Dim StartCol As Variant
      Dim StartRow As Long
      Dim Str As String
      Dim Wks As Worksheet
      Dim X As Variant, Y As Variant
      
       'Setup variables
        StartCol = "A"
        StartRow = 1
        Set Wks = Worksheets("Sheet2")
        
       'Define worksheet area variables
        With Wks
          C = Wks.Cells(1, StartCol).Column
          Cn = C + 1
          LastRow = Wks.Cells(Rows.Count, C).End(xlUp).Row
          LastRow = IIf(LastRow < StartRow, StartRow, LastRow)
          Set Rng = .Range(.Cells(StartRow, Cn), .Cells(LastRow, Cn))
        End With
        
       'Create a Dictionary with case insensitive comparisons
        Set Defs = CreateObject("Scripting.Dictionary")
        Defs.CompareMode = vbTextCompare
        
        With Wks
          For R = StartRow To LastRow
            Str = .Cells(R, Cn)
            GoSub SplitData
            If .Cells(R, C) <> .Cells(R + 1, C) Then
             'Display the accumulated data less the final semi-colon
              GoSub ListDefinitions
              .Cells(R, Cn) = Left(Str, Len(Str) - 1)
              Defs.RemoveAll
              Str = ""
            Else
             'Clear the cell's data
              .Cells(R, Cn) = ""
            End If
          Next R
        End With
         
       'Delete the blank rows
        On Error Resume Next
          Set Rng = Rng.SpecialCells(xlCellTypeBlanks)
          If Err.Number = 0 Then Rng.EntireRow.Delete Shift:=xlShiftUp
        On Error GoTo 0
        
       'Free memory
        Set Defs = Nothing
           
      Exit Sub
          
    SplitData:
         'Split data where there is a semi-colon and remove any duplicates
          I = InStr(1, Str, ";")
          L = Len(Str)
            While I > 0
              X = Left(Str, I - 1)
              On Error Resume Next
                Defs.Add X, X
              L = L - I
              Str = Right(Str, L)
              I = InStr(1, Str, ";")
            Wend
          On Error Resume Next
            Defs.Add Str, Str
            Err.Clear
          On Error GoTo 0
        Return
        
    ListDefinitions:
        Str = ""
        X = Defs.Keys
          For Each Y In X
            Str = Str & Y & ";"
          Next Y
        Return
        
    End Sub
    Sincerely,
    Leith Ross

  14. #14
    Valued Forum Contributor
    Join Date
    11-12-2007
    Location
    Germany
    MS-Off Ver
    2007
    Posts
    389
    Hi,
    its working correctly now. It cant deal with big list but that is not a problem. As i ran the code for 0.5 Millions entires in col. A, it deleted all the data. Then i reduced it to 200 000 words and it didnt function. I kept the number reducing till it worked. Now it works for about 20 000 entries very well. Now i have splitted the whole wordlist into 20t blocks and am processing it step by step. I will let you know if there are some irregularaties.

    Can you please do me another favour? Can you modify the code in such a way that the following columns like col. C, D, E, F and G are processed the same way as column B. I have to process also wordlists which contain word-->grammer-->phonetic-->meaning--->usage. So there are columns involved upto column G. Is it possible to merge the data of other columns the same way as in column B, if a duplicate in column A exist?

    thanks

  15. #15
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,498
    Hello wali,

    Can you post a sample worksheet I could use to develop the code with?

    Sincerely,
    Leith Ross

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.2.0