+ Reply to Thread
Results 1 to 6 of 6

A Challenge

  1. #1
    mjones
    Guest

    A Challenge

    Hi,

    My client is expecting me to perform miracles. I thought I'd throw
    this one out for those who like a challenge. Many large spreadsheets
    (so large they have to be split up to get them into Excel) have a name
    column. An example of the type of data might be:

    Mary Jane Fox
    Marg F. Smith
    Matt Del Sandro
    Frank George Di Marco
    Paul Joseph Rental/Storage Ltd.
    Kim O'Neil
    Don Jones-St. James
    Andrew K.G. Black
    Mr. and Mrs. Albert F. Casey
    Brown
    Sisters of the Holly Cross of 3rd Street
    M Thomas
    Christine J. Main & Ken Henry

    My client wants me to separate the last name into its own column so
    they can run duplicate tests with other files against it and other
    things. Remember, there could be 65,000 names.

    As you can imagine, this is very time consuming so anything will help.
    I have been using space delimited to separate the fields and then doing
    things like sorting or splitting off the first two characters and
    sorting what's left to separate out the single initials and then
    concatenating back again.

    I know this can't be an exact science, but I'm wondering if a macro can
    be written to perform some of the work and perhaps stop and ask about
    questionable names. For example,

    - single letters with or without periods would be first names
    - a database of words could show companies and take the whole word as
    last name like Ltd. or Co.
    - a database of words like Di, St., Del, O', or Le could be defined as
    part of the last name
    - one word names are last names
    - take the last of two names and forget the first one
    - stuff like that

    When you use text to columns space delimited it puts the first word in
    the first column and second in second column, etc. so depending upon
    how many words are in the name, you never know which column the last
    word will end up in. It would be nice to make it start in the last
    column. We could use Access, too, if that would make any difference.

    I don't expect to receive much help on this, but you never know. I
    find some people in this newsgroup have the most amazing ideas.

    Thanks,

    Michele


  2. #2
    William Benson
    Guest

    Re: A Challenge

    I would use Access. Access is fast, and the table results are easy to work
    with. Just my prefereence I guess.Come up with a final table layout that
    will cover all the fields you might ever run into.You'll have to write some
    "stage" logic. Pull off prefixes in one stage, populating the "Prefix" field
    and dumping the rest into a "Remainder" field. Pull off Suffix1, Suffix2,
    Suffix3, ... as many as you think you might need for Sr., M.D., PhD., etc.
    The logic to parse can be done through VBA/SQL.


    A sample including Dr. or M.D. from the Pisacano Leadership Foundation
    Drs. John and Katherine Miller
    Dr. and Mrs. Michael Needleman
    Dr. and Mrs. David Hutcheson-Tipton
    Lillian Gelberg, M.D. and Steven C. Spronz
    Patricia Glowa, M.D. and Donald Kollisch, M.D.


    based on processing logic, not you are not dealing with slow Excel files nor
    worrying how big the spreadsheets.

    I would run routines which "pre-process" the data run separate stages.

    For example, I would have in my ending table, fields like: First, Middle,
    Last, Prefix, Suffix1, Suffix2 (example of the different between suffix1 &
    2 would be Mr. James R. Smith Jr, PHD).

    Each intermediary result table would have fields you know are what they are,
    and a "AllTheRest" field. So, in Stage One, you might look to pull out all
    the prefixes, so intermediary table 1 will be 3 fields

    NameID Prefix AllTheRest.

    Write logic like If the full text includes both Mr. and Mrs., regardless of
    whether they are next to each other, the Prefix is converted to Mr. & Mrs.

    Then, look over ALL the prefixes you have been able to harvest, and
    specifically look
    Processing: I would have a table of known prefixes... and "pull out"
    prefixes, leave ."harvest" what I know, like all the Mr.

    and remove text (and fill fields)
    "mjones" <[email protected]> wrote in message
    news:[email protected]...
    > Hi,
    >
    > My client is expecting me to perform miracles. I thought I'd throw
    > this one out for those who like a challenge. Many large spreadsheets
    > (so large they have to be split up to get them into Excel) have a name
    > column. An example of the type of data might be:
    >
    > Mary Jane Fox
    > Marg F. Smith
    > Matt Del Sandro
    > Frank George Di Marco
    > Paul Joseph Rental/Storage Ltd.
    > Kim O'Neil
    > Don Jones-St. James
    > Andrew K.G. Black
    > Mr. and Mrs. Albert F. Casey
    > Brown
    > Sisters of the Holly Cross of 3rd Street
    > M Thomas
    > Christine J. Main & Ken Henry
    >
    > My client wants me to separate the last name into its own column so
    > they can run duplicate tests with other files against it and other
    > things. Remember, there could be 65,000 names.
    >
    > As you can imagine, this is very time consuming so anything will help.
    > I have been using space delimited to separate the fields and then doing
    > things like sorting or splitting off the first two characters and
    > sorting what's left to separate out the single initials and then
    > concatenating back again.
    >
    > I know this can't be an exact science, but I'm wondering if a macro can
    > be written to perform some of the work and perhaps stop and ask about
    > questionable names. For example,
    >
    > - single letters with or without periods would be first names
    > - a database of words could show companies and take the whole word as
    > last name like Ltd. or Co.
    > - a database of words like Di, St., Del, O', or Le could be defined as
    > part of the last name
    > - one word names are last names
    > - take the last of two names and forget the first one
    > - stuff like that
    >
    > When you use text to columns space delimited it puts the first word in
    > the first column and second in second column, etc. so depending upon
    > how many words are in the name, you never know which column the last
    > word will end up in. It would be nice to make it start in the last
    > column. We could use Access, too, if that would make any difference.
    >
    > I don't expect to receive much help on this, but you never know. I
    > find some people in this newsgroup have the most amazing ideas.
    >
    > Thanks,
    >
    > Michele
    >




  3. #3
    William Benson
    Guest

    Re: A Challenge

    I suggest you use Access, I would. I would plan a final table layout that
    has every field I might possibly need:

    1st Stage... Prefixes:
    Dr., Drs., Rev., Mr., Mr. and Mrs., Col., Lt., Sir, Hon., etc...
    Intermediate result Fields: Original Name, Prefix, Remainder1.

    2nd Stage... Suffix1 (Remember, there could be more and more suffixes).
    M.D., J.D., CPA, Ph. D. (or PhD.), Sr., Jr., Senior, Junior, etc...
    Intermediate result Fields: Original, Prefix, Suffix1, Remainder2.

    If there is no records with any values in Suffix1, move on -- otherwise
    process for:
    3rd Stage... Suffix2 (Remember, there could be more and more suffixes).
    M.D., J.D., CPA, Ph. D. (or PhD.), Sr., Jr., Senior, Junior, etc...
    Intermediate result Fields: Original, Prefix, Suffix1, Suffix2,
    Remainder3.

    If there is no records with any values in Suffix2, move on -- otherwise
    process for:
    4th Stage... Suffix3 (Remember, there could be more and more suffixes).
    M.D., J.D., CPA, Ph. D. (or PhD.), Sr., Jr., Senior, Junior, etc...
    Intermediate result Fields: Original, Prefix, Suffix1, Suffix2,
    Suffix3, Remainder4.

    If there is no records with any values in Suffix3, move on -- otherwise
    process for:
    Intermediate result Fields: Original, Prefix, Suffix1, Suffix2,
    Suffix3, Suffix4, Remainder5. and so on.

    Now comes the hard logic. Try to eliminate as many obvious, simple choices
    as possible. Say, Remainder5 has no apostrophes, no hyphens, no "St." as in
    "St. James", all the wierd things you can think of are not there, as well as
    it is clearly in the format: "XXX* Y. ZZZ*" ... move all such records
    to a "possibly complete" table, and inspect as best you can.

    Then look over the remaining data, and see where you need to go next.

    The point I want to hit home, having done this before, is don't go crazy
    with logic until you find you have to. Another successful strategy is to put
    all your known problems into one place, like you could put all the names
    with hyphens in them in their own table after you get the prefixes and the
    suffixes nailed down, and go back looking for additional logic you need
    later. The nice part of Access is the routines are quick, it is easy to
    filter on the fly (My favorite feature is "Filter Excluding Selection"), and
    saving data and backing up is a breeze PLUS you never have to worry about
    the number of records.

    Good Luck, I know you are really smart and will conquer 99.9% of this. I
    guess it's the remaining 1/10 of 1 percent that makes the miracle :-0



    "


    You have written very clearly, so it shows you are a very clear thinker and
    probably do not need me to offer guidance but I will do anyway because you
    asked. I would definitely use "stages". At each stage, I would pull out what
    I can conclude about the
    "mjones" <[email protected]> wrote in message
    news:[email protected]...
    > Hi,
    >
    > My client is expecting me to perform miracles. I thought I'd throw
    > this one out for those who like a challenge. Many large spreadsheets
    > (so large they have to be split up to get them into Excel) have a name
    > column. An example of the type of data might be:
    >
    > Mary Jane Fox
    > Marg F. Smith
    > Matt Del Sandro
    > Frank George Di Marco
    > Paul Joseph Rental/Storage Ltd.
    > Kim O'Neil
    > Don Jones-St. James
    > Andrew K.G. Black
    > Mr. and Mrs. Albert F. Casey
    > Brown
    > Sisters of the Holly Cross of 3rd Street
    > M Thomas
    > Christine J. Main & Ken Henry
    >
    > My client wants me to separate the last name into its own column so
    > they can run duplicate tests with other files against it and other
    > things. Remember, there could be 65,000 names.
    >
    > As you can imagine, this is very time consuming so anything will help.
    > I have been using space delimited to separate the fields and then doing
    > things like sorting or splitting off the first two characters and
    > sorting what's left to separate out the single initials and then
    > concatenating back again.
    >
    > I know this can't be an exact science, but I'm wondering if a macro can
    > be written to perform some of the work and perhaps stop and ask about
    > questionable names. For example,
    >
    > - single letters with or without periods would be first names
    > - a database of words could show companies and take the whole word as
    > last name like Ltd. or Co.
    > - a database of words like Di, St., Del, O', or Le could be defined as
    > part of the last name
    > - one word names are last names
    > - take the last of two names and forget the first one
    > - stuff like that
    >
    > When you use text to columns space delimited it puts the first word in
    > the first column and second in second column, etc. so depending upon
    > how many words are in the name, you never know which column the last
    > word will end up in. It would be nice to make it start in the last
    > column. We could use Access, too, if that would make any difference.
    >
    > I don't expect to receive much help on this, but you never know. I
    > find some people in this newsgroup have the most amazing ideas.
    >
    > Thanks,
    >
    > Michele
    >




  4. #4
    William Benson
    Guest

    Re: A Challenge

    This got sent by accident, and was not complete (and many of the thoughts
    were "re-thunk" ... sorry for the seeming double post and PLEASE -- NO
    FLAMES FROM ANYONE, ACCIDENTS HAPPEN :-)


    "William Benson" <wbenson1(SPAMSUCKS)@nycap.rr.com> wrote in message
    news:[email protected]...
    >I would use Access. Access is fast, and the table results are easy to work
    >with. Just my prefereence I guess.Come up with a final table layout that
    >will cover all the fields you might ever run into.You'll have to write some
    >"stage" logic. Pull off prefixes in one stage, populating the "Prefix"
    >field and dumping the rest into a "Remainder" field. Pull off Suffix1,
    >Suffix2, Suffix3, ... as many as you think you might need for Sr., M.D.,
    >PhD., etc. The logic to parse can be done through VBA/SQL.
    >
    >
    > A sample including Dr. or M.D. from the Pisacano Leadership Foundation
    > Drs. John and Katherine Miller
    > Dr. and Mrs. Michael Needleman
    > Dr. and Mrs. David Hutcheson-Tipton
    > Lillian Gelberg, M.D. and Steven C. Spronz
    > Patricia Glowa, M.D. and Donald Kollisch, M.D.
    >
    >
    > based on processing logic, not you are not dealing with slow Excel files
    > nor worrying how big the spreadsheets.
    >
    > I would run routines which "pre-process" the data run separate stages.
    >
    > For example, I would have in my ending table, fields like: First, Middle,
    > Last, Prefix, Suffix1, Suffix2 (example of the different between suffix1
    > & 2 would be Mr. James R. Smith Jr, PHD).
    >
    > Each intermediary result table would have fields you know are what they
    > are, and a "AllTheRest" field. So, in Stage One, you might look to pull
    > out all the prefixes, so intermediary table 1 will be 3 fields
    >
    > NameID Prefix AllTheRest.
    >
    > Write logic like If the full text includes both Mr. and Mrs., regardless
    > of whether they are next to each other, the Prefix is converted to Mr. &
    > Mrs.
    >
    > Then, look over ALL the prefixes you have been able to harvest, and
    > specifically look
    > Processing: I would have a table of known prefixes... and "pull out"
    > prefixes, leave ."harvest" what I know, like all the Mr.
    >
    > and remove text (and fill fields)
    > "mjones" <[email protected]> wrote in message
    > news:[email protected]...
    >> Hi,
    >>
    >> My client is expecting me to perform miracles. I thought I'd throw
    >> this one out for those who like a challenge. Many large spreadsheets
    >> (so large they have to be split up to get them into Excel) have a name
    >> column. An example of the type of data might be:
    >>
    >> Mary Jane Fox
    >> Marg F. Smith
    >> Matt Del Sandro
    >> Frank George Di Marco
    >> Paul Joseph Rental/Storage Ltd.
    >> Kim O'Neil
    >> Don Jones-St. James
    >> Andrew K.G. Black
    >> Mr. and Mrs. Albert F. Casey
    >> Brown
    >> Sisters of the Holly Cross of 3rd Street
    >> M Thomas
    >> Christine J. Main & Ken Henry
    >>
    >> My client wants me to separate the last name into its own column so
    >> they can run duplicate tests with other files against it and other
    >> things. Remember, there could be 65,000 names.
    >>
    >> As you can imagine, this is very time consuming so anything will help.
    >> I have been using space delimited to separate the fields and then doing
    >> things like sorting or splitting off the first two characters and
    >> sorting what's left to separate out the single initials and then
    >> concatenating back again.
    >>
    >> I know this can't be an exact science, but I'm wondering if a macro can
    >> be written to perform some of the work and perhaps stop and ask about
    >> questionable names. For example,
    >>
    >> - single letters with or without periods would be first names
    >> - a database of words could show companies and take the whole word as
    >> last name like Ltd. or Co.
    >> - a database of words like Di, St., Del, O', or Le could be defined as
    >> part of the last name
    >> - one word names are last names
    >> - take the last of two names and forget the first one
    >> - stuff like that
    >>
    >> When you use text to columns space delimited it puts the first word in
    >> the first column and second in second column, etc. so depending upon
    >> how many words are in the name, you never know which column the last
    >> word will end up in. It would be nice to make it start in the last
    >> column. We could use Access, too, if that would make any difference.
    >>
    >> I don't expect to receive much help on this, but you never know. I
    >> find some people in this newsgroup have the most amazing ideas.
    >>
    >> Thanks,
    >>
    >> Michele
    >>

    >
    >




  5. #5
    mjones
    Guest

    Re: A Challenge

    Thank you Will. I'm not very good at Access, but I suppose now is a
    good time to work on my skills. I appreciate your time and ideas.

    Michele


  6. #6
    Valued Forum Contributor
    Join Date
    06-30-2005
    Location
    Verwood, Dorset, England
    MS-Off Ver
    Excel 2000
    Posts
    479
    Quote Originally Posted by mjones
    Hi,

    My client is expecting me to perform miracles. I thought I'd throw
    this one out for those who like a challenge. Many large spreadsheets
    (so large they have to be split up to get them into Excel) have a name
    column. An example of the type of data might be:

    Mary Jane Fox
    Marg F. Smith
    Matt Del Sandro
    Frank George Di Marco
    Paul Joseph Rental/Storage Ltd.
    Kim O'Neil
    Don Jones-St. James
    Andrew K.G. Black
    Mr. and Mrs. Albert F. Casey
    Brown
    Sisters of the Holly Cross of 3rd Street
    M Thomas
    Christine J. Main & Ken Henry

    My client wants me to separate the last name into its own column so
    they can run duplicate tests with other files against it and other
    things. Remember, there could be 65,000 names.

    As you can imagine, this is very time consuming so anything will help.
    I have been using space delimited to separate the fields and then doing
    things like sorting or splitting off the first two characters and
    sorting what's left to separate out the single initials and then
    concatenating back again.

    I know this can't be an exact science, but I'm wondering if a macro can
    be written to perform some of the work and perhaps stop and ask about
    questionable names. For example,

    - single letters with or without periods would be first names
    - a database of words could show companies and take the whole word as
    last name like Ltd. or Co.
    - a database of words like Di, St., Del, O', or Le could be defined as
    part of the last name
    - one word names are last names
    - take the last of two names and forget the first one
    - stuff like that

    When you use text to columns space delimited it puts the first word in
    the first column and second in second column, etc. so depending upon
    how many words are in the name, you never know which column the last
    word will end up in. It would be nice to make it start in the last
    column. We could use Access, too, if that would make any difference.

    I don't expect to receive much help on this, but you never know. I
    find some people in this newsgroup have the most amazing ideas.

    Thanks,

    Michele
    Hi Michelle

    Try a search for splitting names, this returns a number of threads you might find useful

    I asked a similar, not so complex question not long ago and got some excellent help, so try changing the title of your query to something like "Help with splitting names"

    In this thread http://www.excelforum.com/showthread...plitting+names Bob Phillips in the 9th post says "I have a RegExp solution for any combination. It gets worse, because you can
    have name like William A. Carson Jr., or Ian St. John, or even Marquis de
    Sade." so I am sure he can help or provide more detail about his RegExp solution
    Paul

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1