+ Reply to Thread
Results 1 to 4 of 4

String splitting for inconsistent strings

  1. #1
    Richard
    Guest

    String splitting for inconsistent strings

    I receive a series of daily share prices in an html email which I save
    as a file and then have a macro load this file into a data sheet.

    Amongst all the html stuff there are recognisable share price strings
    like this in Column A

    ARM Holdings 116.00p -0.03p (-2.11)
    Amino Technologies 98.50p 0.00p (-0.00)
    BAA 872.50p -0.01p (-0.06)
    McCarthey and Stone 766.00p 0.04p (0.52)


    I'd like to slice these up so that in columns B:C I end up with the
    following

    B1: Arm Holdings, C1: 116.00
    B2: Amino Technologies, C2: 98.50
    B3: BAA, C3: 872.50
    B4: McCarthy and Stone, C4: 766.00

    Unfortunately some names are two or more words long, and others a
    single word, which makes identifying the split between the name and the
    price difficult.

    Until recently, Digital Look.com who send the emails had a slightly
    different layout where there was a unique identifier company ID code
    surrounded with ( ) which made it reasonably easy to identify the split
    between the name and the price. However their layout has now changed to
    the above, and I'm strruggling to find an elegant solution to slicing
    the string.

    I'd be grateful for any help or suggestions

    Usual TIA,

    Richard Buttrey


  2. #2
    Ron Rosenfeld
    Guest

    Re: String splitting for inconsistent strings

    On 20 Jun 2006 09:12:09 -0700, "Richard" <[email protected]> wrote:

    >I receive a series of daily share prices in an html email which I save
    >as a file and then have a macro load this file into a data sheet.
    >
    >Amongst all the html stuff there are recognisable share price strings
    >like this in Column A
    >
    >ARM Holdings 116.00p -0.03p (-2.11)
    >Amino Technologies 98.50p 0.00p (-0.00)
    >BAA 872.50p -0.01p (-0.06)
    >McCarthey and Stone 766.00p 0.04p (0.52)
    >
    >
    >I'd like to slice these up so that in columns B:C I end up with the
    >following
    >
    >B1: Arm Holdings, C1: 116.00
    >B2: Amino Technologies, C2: 98.50
    >B3: BAA, C3: 872.50
    >B4: McCarthy and Stone, C4: 766.00
    >
    >Unfortunately some names are two or more words long, and others a
    >single word, which makes identifying the split between the name and the
    >price difficult.
    >
    >Until recently, Digital Look.com who send the emails had a slightly
    >different layout where there was a unique identifier company ID code
    >surrounded with ( ) which made it reasonably easy to identify the split
    >between the name and the price. However their layout has now changed to
    >the above, and I'm strruggling to find an elegant solution to slicing
    >the string.
    >
    >I'd be grateful for any help or suggestions
    >
    >Usual TIA,
    >
    >Richard Buttrey


    It seems as if this can be done with "Regular Expressions". I will make use of
    the fact that the value which you want to pull out is the first floating point
    number that ends with a "p " (thats "p" followed by <space>). If there are
    any issues where the stock name includes a string of the form <number>p<space>
    you might want to change the formula to take that into account.

    Download and install Longre's free morefunc.xll add-in from
    http://xcell05.free.fr

    To extract the company name:

    =REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s))")

    To extract the price:

    =REGEX.MID(A1,"\d+(\.\d*)?(?=p\s)")

    To be a little more bullet-proof one can require that the price number be
    identified as the first positive floating point number followed by the sequence
    of p<space> then optionally a "-" and another digit. So the formula would
    then be:

    Company Name:
    =REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s-?\d))")

    Price:
    =REGEX.MID(A1,"\d+(\.\d*)?(?=p\s-?\d)")


    --ron

  3. #3
    Richard Buttrey
    Guest

    Re: String splitting for inconsistent strings

    On Tue, 20 Jun 2006 13:29:40 -0400, Ron Rosenfeld
    <[email protected]> wrote:

    >
    >It seems as if this can be done with "Regular Expressions". I will make use of
    >the fact that the value which you want to pull out is the first floating point
    >number that ends with a "p " (thats "p" followed by <space>). If there are
    >any issues where the stock name includes a string of the form <number>p<space>
    >you might want to change the formula to take that into account.
    >
    >Download and install Longre's free morefunc.xll add-in from
    >http://xcell05.free.fr
    >
    >To extract the company name:
    >
    >=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s))")
    >
    >To extract the price:
    >
    >=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s)")
    >
    >To be a little more bullet-proof one can require that the price number be
    >identified as the first positive floating point number followed by the sequence
    >of p<space> then optionally a "-" and another digit. So the formula would
    >then be:
    >
    >Company Name:
    >=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s-?\d))")
    >
    >Price:
    >=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s-?\d)")
    >
    >
    >--ron


    Ron,

    Many thanks for the pointer to this Add In which I've just downloaded.
    The REGEX function looks very interesting - I just need to get my mind
    around the syntax and terminology.

    One supplementary if I may. I've just used your example on my data and
    it works a treat.
    However there's one other small variation I've just noticed, and
    that's where the integer part of the share price is four characters
    with a comma separating the thousands.

    e.g. the following string

    GlaxoSmithKline 1,480.00p 0.16p (1.09)

    returns nothing for the name in B1, and in C1 480.00, not 1480.00

    Can you suggest a modification that will overcome this please - it'll
    probably take me a couple of weeks to understand the syntax!

    Kind regards,

    Richard Buttrey



    __
    Richard Buttrey
    Grappenhall, Cheshire, UK
    __________________________

  4. #4
    Ron Rosenfeld
    Guest

    Re: String splitting for inconsistent strings

    On Thu, 22 Jun 2006 00:04:37 +0100, Richard Buttrey
    <[email protected]> wrote:

    >On Tue, 20 Jun 2006 13:29:40 -0400, Ron Rosenfeld
    ><[email protected]> wrote:
    >
    >>
    >>It seems as if this can be done with "Regular Expressions". I will make use of
    >>the fact that the value which you want to pull out is the first floating point
    >>number that ends with a "p " (thats "p" followed by <space>). If there are
    >>any issues where the stock name includes a string of the form <number>p<space>
    >>you might want to change the formula to take that into account.
    >>
    >>Download and install Longre's free morefunc.xll add-in from
    >>http://xcell05.free.fr
    >>
    >>To extract the company name:
    >>
    >>=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s))")
    >>
    >>To extract the price:
    >>
    >>=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s)")
    >>
    >>To be a little more bullet-proof one can require that the price number be
    >>identified as the first positive floating point number followed by the sequence
    >>of p<space> then optionally a "-" and another digit. So the formula would
    >>then be:
    >>
    >>Company Name:
    >>=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s-?\d))")
    >>
    >>Price:
    >>=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s-?\d)")
    >>
    >>
    >>--ron

    >
    >Ron,
    >
    >Many thanks for the pointer to this Add In which I've just downloaded.
    >The REGEX function looks very interesting - I just need to get my mind
    >around the syntax and terminology.
    >
    >One supplementary if I may. I've just used your example on my data and
    >it works a treat.
    >However there's one other small variation I've just noticed, and
    >that's where the integer part of the share price is four characters
    >with a comma separating the thousands.
    >
    >e.g. the following string
    >
    >GlaxoSmithKline 1,480.00p 0.16p (1.09)
    >
    >returns nothing for the name in B1, and in C1 480.00, not 1480.00
    >
    >Can you suggest a modification that will overcome this please - it'll
    >probably take me a couple of weeks to understand the syntax!
    >
    >Kind regards,
    >
    >Richard Buttrey
    >
    >
    >
    >__
    >Richard Buttrey
    >Grappenhall, Cheshire, UK
    >__________________________


    What you need to do, I believe, is optionally look for zero or more sequences,
    within the number, of a comma followed by three digits. The syntax for that
    would be: (,\d{3})*



    I think this modification will do that:

    \d+(,\d{3})*(\.\d*)(?=p\s)

    So the modified "more bulletproof" formulas would be:

    Company Name:

    =REGEX.MID(A1,"(\w+\s?)+(?=\s\d+(,\d{3})*(\.\d*)?(?=p\s-?\d))")

    Price:

    =REGEX.MID(A1,"\d+(,\d{3})*(\.\d*)?(?=p\s-?\d)")

    If you look at the Company Name formula, you will note I made a small change as
    the previous formula would include a trailing <space> after each name. This
    one does not.

    The simpler formulas would be

    Name:

    =REGEX.MID(A1,"(\w+\s?)+(?=\s\d+(,\d{3})*(\.\d*)?(?=p\s))")

    Price:

    =REGEX.MID(A1,"\d+(,\d{3})*(\.\d*)(?=p\s)")

    =========================================

    http://www.regular-expressions.info/reference.html

    is a place on the net to start learning a bit about syntax.

    I would also recommend the book "Mastering Regular Expressions" by Jeffrey
    Friedl. I got mine for just a few dollars, used, at amazon.com

    Finally, be aware that a limitation of Longre's functions is that they cannot
    process strings longer than 255 characters. I believe this limitation is due
    to the data types used in the XLL, and also applies to the other functions in
    the add-in.

    If you need to process longer strings, it can be done in VBA by setting a
    reference to Microsoft VBScript Regular Expressions 5.5 and writing some code.
    I've got a little of that done, but nothing as flexible as what Longre
    provides. And there's also some syntax that won't work in the VBScript
    implementation, so I rarely bother with it.

    Have fun
    --ron

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1