+ Reply to Thread
Results 1 to 7 of 7

Evaluating similarity of text strings

  1. #1
    Alan
    Guest

    Evaluating similarity of text strings


    Hi All,

    I have a data list - roughly 2000 items in all, which are names
    (people, companies, organisations etc) and an amount they have paid.

    The data is collected from four separate sources and trimmed already
    to
    avoid extra spaces. The upshot is that there are instances where the
    same name is entered more than one way. For example:

    Alan B Chadd
    Bob Charles
    Chadd Alan B

    Now a human eye can easily spot that the first and third are
    (probably) the same person and it is worth investigating further.
    This is fine when they are close to each other, but not practical
    when they are two pages apart.

    Therefore, I would like to write a formula that could give a score
    that
    ranks the likely similar entries.

    I am thinking that something like this would be good:

    1) Take each letter of the target name, and count how many times it
    appears in every other of the 2000 entries ignoring capitals (it is
    possible
    someone will have typed in a name with or without any proper
    capitalisation).

    2) Add up the totals for each of the 2000 entries

    3) Show any items with a score over X (to be picked by trial) or just
    sort them by the score.


    Example From Above:

    The score for "Alan B Chadd" against "Bob Charles" would be:

    A = 0
    l = 1
    a = 0
    n = 0
    Space = 1
    B = 1
    Space = 1
    C = 1
    h = 1
    a = 0
    d = 0
    d = 0

    Total = 5

    The score for "Alan B Chadd" against "Chadd Alan B" would be:

    A = 3
    l = 1
    a = 3
    n = 1
    Space = 2
    B = 1
    Space = 2
    C = 1
    h = 1
    a = 3
    d = 2
    d = 2

    Total = 22

    Obviously it is not perfect, but it should be good enough to point the
    human in the right direction!


    Can anyone suggest a way to do this?

    Thanks in advance,

    Alan.





  2. #2
    Jason Morin
    Guest

    Re: Evaluating similarity of text strings

    Not perfect either, but...with names in column A, place
    this formula in row 1 of an open column and fill down:

    =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)))

    Then sort the data based on this column. Similar strings
    should be closely grouped.

    HTH
    Jason
    Atlanta, GA

    >-----Original Message-----
    >
    >Hi All,
    >
    >I have a data list - roughly 2000 items in all, which

    are names
    >(people, companies, organisations etc) and an amount

    they have paid.
    >
    >The data is collected from four separate sources and

    trimmed already
    >to
    >avoid extra spaces. The upshot is that there are

    instances where the
    >same name is entered more than one way. For example:
    >
    >Alan B Chadd
    >Bob Charles
    >Chadd Alan B
    >
    >Now a human eye can easily spot that the first and third

    are
    >(probably) the same person and it is worth investigating

    further.
    >This is fine when they are close to each other, but not

    practical
    >when they are two pages apart.
    >
    >Therefore, I would like to write a formula that could

    give a score
    >that
    >ranks the likely similar entries.
    >
    >I am thinking that something like this would be good:
    >
    >1) Take each letter of the target name, and count how

    many times it
    >appears in every other of the 2000 entries ignoring

    capitals (it is
    >possible
    >someone will have typed in a name with or without any

    proper
    >capitalisation).
    >
    >2) Add up the totals for each of the 2000 entries
    >
    >3) Show any items with a score over X (to be picked by

    trial) or just
    >sort them by the score.
    >
    >
    >Example From Above:
    >
    >The score for "Alan B Chadd" against "Bob Charles" would

    be:
    >
    >A = 0
    >l = 1
    >a = 0
    >n = 0
    >Space = 1
    >B = 1
    >Space = 1
    >C = 1
    >h = 1
    >a = 0
    >d = 0
    >d = 0
    >
    >Total = 5
    >
    >The score for "Alan B Chadd" against "Chadd Alan B"

    would be:
    >
    >A = 3
    >l = 1
    >a = 3
    >n = 1
    >Space = 2
    >B = 1
    >Space = 2
    >C = 1
    >h = 1
    >a = 3
    >d = 2
    >d = 2
    >
    >Total = 22
    >
    >Obviously it is not perfect, but it should be good

    enough to point the
    >human in the right direction!
    >
    >
    >Can anyone suggest a way to do this?
    >
    >Thanks in advance,
    >
    >Alan.
    >
    >
    >
    >
    >.
    >


  3. #3
    Alan
    Guest

    Re: Evaluating similarity of text strings

    "Jason Morin" <[email protected]> wrote
    in message news:[email protected]...
    >
    > Not perfect either, but...with names in column A, place
    > this formula in row 1 of an open column and fill down:
    >
    > =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)))
    >
    > Then sort the data based on this column. Similar strings
    > should be closely grouped.
    >
    > HTH
    > Jason
    > Atlanta, GA
    >


    Hi Jason,

    I like the simplicity of your approach - just adding the code values
    together.

    However, if I change my example slightly and apply the formula I get:

    Alan B Chadd................978
    Bob Charles...................1013
    Chadd A B.....................663


    I think that in a larger sample we would still find that the first and
    third lines are too far apart for a human to spot them together.

    The problem arises because it is very sensitive to the number of
    characters. Whether someone has, for example, entered a middle name
    or not, will affect the score too much. Similarly the scores for the
    following two versions of a company name are too far apart:

    ABC LTD...............458
    abc limited...............1070


    I have tried thinking about a way to modify your approach to make it
    work better under this circumstance. I could apply an UPPER function
    to each string first giving the improved result of:

    ABC LTD...............458
    ABC LIMITED........750


    It can be further improved by subtracting 31 from the code values (a
    space is char(32) so let's just avoid negative results):

    =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))-31)

    This gives a further improvement:

    ABC LTD........................241
    ABC LIMITED...............409


    But this is still likely to put them a long way apart in a sort and it
    *feels* to me that we are just compressing the distribution of results
    rather than improving the ability to identify similarities.

    Do you have any other ideas?

    Thanks for your help,

    Alan.







  4. #4
    Jason Morin
    Guest

    Re: Evaluating similarity of text strings

    Well, with the examples you gave, my example worked
    pretty well. It's obvious that you data is much more
    inconsistent in format than your original post lead me to
    believe.

    What you're looking for is an extremely complex
    algorithem to handle such wide disparities in the data.
    Even then, you won't get close on some of them. I think
    the solution lies upstream in the process, before
    receiving the data. If you have any influence on your
    sources, you would try to insert some type of consistency
    in the way they report their data, so that you don't end
    up in the mess you're in now.

    I'd suggest applying what I've given you, enhance it as
    much as you can, then suck it up and do a manual check on
    the rest.

    Jason

    >-----Original Message-----
    >"Jason Morin" <[email protected]> wrote
    >in message news:[email protected]...
    >>
    >> Not perfect either, but...with names in column A, place
    >> this formula in row 1 of an open column and fill down:
    >>
    >> =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN

    (A1))),1)))
    >>
    >> Then sort the data based on this column. Similar

    strings
    >> should be closely grouped.
    >>
    >> HTH
    >> Jason
    >> Atlanta, GA
    >>

    >
    >Hi Jason,
    >
    >I like the simplicity of your approach - just adding the

    code values
    >together.
    >
    >However, if I change my example slightly and apply the

    formula I get:
    >
    >Alan B Chadd................978
    >Bob Charles...................1013
    >Chadd A B.....................663
    >
    >
    >I think that in a larger sample we would still find that

    the first and
    >third lines are too far apart for a human to spot them

    together.
    >
    >The problem arises because it is very sensitive to the

    number of
    >characters. Whether someone has, for example, entered a

    middle name
    >or not, will affect the score too much. Similarly the

    scores for the
    >following two versions of a company name are too far

    apart:
    >
    >ABC LTD...............458
    >abc limited...............1070
    >
    >
    >I have tried thinking about a way to modify your

    approach to make it
    >work better under this circumstance. I could apply an

    UPPER function
    >to each string first giving the improved result of:
    >
    >ABC LTD...............458
    >ABC LIMITED........750
    >
    >
    >It can be further improved by subtracting 31 from the

    code values (a
    >space is char(32) so let's just avoid negative results):
    >
    >=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))-

    31)
    >
    >This gives a further improvement:
    >
    >ABC LTD........................241
    >ABC LIMITED...............409
    >
    >
    >But this is still likely to put them a long way apart in

    a sort and it
    >*feels* to me that we are just compressing the

    distribution of results
    >rather than improving the ability to identify

    similarities.
    >
    >Do you have any other ideas?
    >
    >Thanks for your help,
    >
    >Alan.
    >
    >
    >
    >
    >
    >
    >.
    >


  5. #5
    Bob Phillips
    Guest

    Re: Evaluating similarity of text strings


    "Jason Morin" <[email protected]> wrote in message
    news:[email protected]...

    >
    > I'd suggest applying what I've given you, enhance it as
    > much as you can, then suck it up and do a manual check on
    > the rest.


    or use it as supporting material as to how difficult the problem is, and how
    it is better to fix the source of the problem, not the manifestation.



  6. #6
    Alan
    Guest

    Re: Evaluating similarity of text strings

    "Jason Morin" <[email protected]>
    wrote in message news:[email protected]...
    >
    > Well, with the examples you gave, my example worked
    > pretty well. It's obvious that you data is much more
    > inconsistent in format than your original post lead me to
    > believe.
    >


    Agreed - I do appreciate your assistance.

    >
    > What you're looking for is an extremely complex
    > algorithem to handle such wide disparities in the data.
    > Even then, you won't get close on some of them. I think
    > the solution lies upstream in the process, before
    > receiving the data. If you have any influence on your
    > sources, you would try to insert some type of consistency
    > in the way they report their data, so that you don't end
    > up in the mess you're in now.
    >


    Unfortunately the data is already in existence for many periods and
    that is what we have to work with.

    The systems going forwards are fully integrated so the issue does
    exist in the same way.

    >
    > I'd suggest applying what I've given you, enhance it as
    > much as you can, then suck it up and do a manual check on
    > the rest.
    >
    > Jason
    >


    I wish we could afford the manual time, it would be easier from a
    management perspective at least - I will keep on looking for an
    algorithm, but thank you again for your assistance.

    Alan.




  7. #7
    Alan
    Guest

    Re: Evaluating similarity of text strings


    "Bob Phillips" <[email protected]> wrote in message
    news:%[email protected]...
    >
    > "Jason Morin" <[email protected]> wrote in message
    > news:[email protected]...
    >
    > >
    > > I'd suggest applying what I've given you, enhance it as
    > > much as you can, then suck it up and do a manual check on
    > > the rest.

    >
    > or use it as supporting material as to how difficult the problem is,
    > and how it is better to fix the source of the problem, not the
    > manifestation.
    >


    Hi Bob,

    Unfortunately the data is already in existence for many periods and
    that is what we have to work with for the statutory returns.

    The systems going forwards are fully integrated so the issue does
    exist in the same way - the source of the problem is already fixed,
    now I just have to find a way to report the history accurately enough
    for the authorities.

    I will keep looking for an algorithm - if you have any other ideas,
    please do post back.

    Thanks,

    Alan.






+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1