+ Reply to Thread
Results 1 to 2 of 2

Removing Near-Duplicate Rows, Leaving Those w/Most Data in Specific Columns

  1. #1
    foofoo
    Guest

    Removing Near-Duplicate Rows, Leaving Those w/Most Data in Specific Columns

    I have a spreadsheet with about 20,000 rows of data and I am trying to
    eliminate unnecessary rows. Columns B&C contain numbers, and columns D
    through AR contain text. Many rows contain identical information in
    columns B, C, D, E, & F, but the information in the remaining columns
    may differ. I would like to eliminate rows with duplicate data in
    columns B through F, retaining the rows with the greatest number of
    entries in columns G through AR.

    An example of what I would like to get to is shown below.

    BEFORE
    B C D E F G H I J K L
    Row 1 12 34 DD EE FF GG HH II JJ KK LL
    Row 2 12 34 DD EE FF GG HH II JJ
    Row 3 12 34 DD EE FF GG HH II

    AFTER
    B C D E F G H I J K L
    Row 1 12 34 DD EE FF GG HH II JJ KK LL

    I am not familiar with Visual Basic, so I would prefer to handle this
    with a formula, if possible.

    Can anyone help?




    Thanks!




    Sandi


  2. #2
    Dave Peterson
    Guest

    Re: Removing Near-Duplicate Rows, Leaving Those w/Most Data in SpecificColumns

    How about a little work:

    First, do this against a copy--just in case!

    Insert two new columns A:C in that worksheet.

    In A1, put this formula:
    =row()

    In B1:
    =d1&CHAR(1)&e1&CHAR(1)&f1&CHAR(1)&g1&CHAR(10)&h1
    (it concatenates the stuff in C:G (formerly A:F)

    Then in C1, put this:
    =COUNTA(J1:AU1)
    (formerly G:AR)

    And drag down 20000 rows.
    select A:C
    edit|copy
    edit|paste special|Values
    (things will work faster if you have values, not formulas)

    Add a new row 1 (for headers).

    Now select the whole range and sort by column A in ascending order and a
    secondary key of column B (descending order).

    The top one of each group will be the one that had the most entries.

    Now select column B.
    And do Data|Filter|Advanced filter
    filter in place and Unique records only.

    Now the rows you can see are the ones you want.
    Select those visible rows
    edit|copy
    paste to a new worksheet.

    Sort this data by column A to put it back in the original order.
    delete columns A:C





    foofoo wrote:
    >
    > I have a spreadsheet with about 20,000 rows of data and I am trying to
    > eliminate unnecessary rows. Columns B&C contain numbers, and columns D
    > through AR contain text. Many rows contain identical information in
    > columns B, C, D, E, & F, but the information in the remaining columns
    > may differ. I would like to eliminate rows with duplicate data in
    > columns B through F, retaining the rows with the greatest number of
    > entries in columns G through AR.
    >
    > An example of what I would like to get to is shown below.
    >
    > BEFORE
    > B C D E F G H I J K L
    > Row 1 12 34 DD EE FF GG HH II JJ KK LL
    > Row 2 12 34 DD EE FF GG HH II JJ
    > Row 3 12 34 DD EE FF GG HH II
    >
    > AFTER
    > B C D E F G H I J K L
    > Row 1 12 34 DD EE FF GG HH II JJ KK LL
    >
    > I am not familiar with Visual Basic, so I would prefer to handle this
    > with a formula, if possible.
    >
    > Can anyone help?
    >
    > Thanks!
    >
    > Sandi


    --

    Dave Peterson

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1