+ Reply to Thread
Results 1 to 19 of 19

remove common English words

  1. #1
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    remove common English words

    What is good programming code for removing common English words (say a dozen, such as
    Please Login or Register  to view this content.
    , etc.) from cells in Excel? The questions I'm running into are whether to try to do it at the formula level, probably using Substitute, or at the VB level, such as with the code below.

    When using the code below, the words in the strings to search are separated with commas (for example, keyword lists) and the words in my list of words to find have that same character on each side of the word (such as
    Please Login or Register  to view this content.
    ). So, this method works okay, but will not find the common words at the beginning or the end of the strings, because if I add something like
    Please Login or Register  to view this content.
    or
    Please Login or Register  to view this content.
    in order to find the common words at beginning or end of cell, then I wind up stripping characters out of other words such as ,theme.

    Please Login or Register  to view this content.

    So, I'm wondering what others have done to accomplish something like this.
    Last edited by finlander; 06-08-2011 at 11:37 AM.

  2. #2
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    It would help to see some actual data. Do have a workbook you post?
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  3. #3
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    Hi Leith,

    Yes, sure, here is a workbook attached with VB. This is usually a csv book because the rows are the result of some concatenating that I do in the xls workbook (I then paste the values to the csv book). My formulas in the xls workbook are what I use to make keyword lists out of the title of the product.

    That part is easy -- I'm just using a Substitute formula to get the comma-separated keyword list, but in order for the list to be useful when I do mass importing using the csv, I've got to figure out a way to effectively strip out the common English words. Otherwise, I'll just have to start manually typing the words for the tag cloud and meta_keywords, which I was trying to avoid. Hence, the whole purpose of trying to find a programmatic solution.

    (The VB code in the csv book is acting on the 'tags' field for the time being. The tags field and the meta_keywords field can and/or will be very similar, so I was just testing on the tags field for now with my VB cell references.)
    Attached Files Attached Files

  4. #4
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    Why are there commas in column "M"?

  5. #5
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    Hi Leith,

    Those are to replace
    Please Login or Register  to view this content.
    with
    Please Login or Register  to view this content.
    That works okay for stuff in the middle, because the common English word is stripped out and replaced with a comma, so that's fine, but if I try to expand the search (cells L9 and L10) to include beginning and end of the cell, then a new problem begins.

    For example, if the search is expanded to search for

    a,

    at the beginning of a cell, then the text
    Please Login or Register  to view this content.
    anywhere would lose its last letter.

    That's the part I'm struggling with.
    Last edited by finlander; 06-08-2011 at 12:43 AM.

  6. #6
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    Okay, that clears up that piece. So with this line...
    with,General,the,Cable,a,Corporation,his
    after the macro is run should look like this?
    with,General,the,Cable,Corporation,his

  7. #7
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    Yes,

    actually, in this test case, it would look like this 'cause 'the' gets stripped:

    with,General,Cable,Corporation,his

  8. #8
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    I'm back. Had to cut up a ham and put it away. I understand now what you want. I'll put together a parsing routine for you. Are you familiar at all with Regular Expressions?

  9. #9
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    Hello, thanks for your help, yes, I've worked quite a bit with regular expressions, but I have not worked that much with VB, so it would have been a challenge to combine the two. I had thought that the solution might have involved regex to simply identify beginning or end of line, so that whatever list of common English words is used, it would would work without slaughtering other keeper words. Beyond that, I had no idea how to implement it, or if it was the right idea at all. Many thanks for your willingness to look more closely as this challenge.

    update: to be precise, this text

    the,then,General,with,Cable,a,lathe,Corporation,for
    would ideally end up looking like this

    General,Cable,lathe,Corporation
    It was expanding my 'middle stuff' solution to handle 'beginning and end' without fouling up other words, that was baffling me.
    Last edited by finlander; 06-08-2011 at 01:49 AM. Reason: clarification

  10. #10
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    Using Regular Expressions to parse and replace the words is the most flexible and expedient solution. Glad to hear you have some familiarity with them. At least the code won't look totally alien to you.

  11. #11
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    A quick note before you get too far .. do you think it's best to implement regex in the VB, or would it be worth it to try to use regex in the formula? I can't imagine a way of identifying a dozen or two dozen English words using regex in the formula, but fwiw, here is my Substitution formula that creates the comma-separated list.

    =SUBSTITUTE(AdData!B2;" ";",")
    pretty simple, just replaces spaces with commas, leaving in all common English words if they are present. It was after getting this far that I started to think about programmatically stripping out the ordinary words from the result. You will notice that I'm using semi-colons, 'cause of some other import software that I'm using.

    Be back tomorrow. 'Night.

  12. #12
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    The macro below has been added to the attached workbook. A button on the worksheet runs it. The Find keyword range starts at "L5" and continues down to the last entry in column "L". Likewise, The Replace range starts at cell "M2". If you have any other punctuation in the words, like hyphens, apostrophes, etc. I will need to make a minor change to the macro. Otherwise it should work with no problems.

    Find and Replace Macro
    Please Login or Register  to view this content.
    Attached Files Attached Files

  13. #13
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    Wow, Leith, you are awesome. Thank You. Your solution is outstanding.

    This forum is great, too. Among the best I've seen.

    I'll give the code some test drives with real bulk data, and let you know how it goes in a PM. Thread will be marked as Solved and all scales credit that I can give!

    Have a super Wednesday!

  14. #14
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    'morning Leith,

    The VB works very nicely, thank you again.

    There is one small change we can make to make it bullet-proof. I do need to replace ,[theword], with a comma, so I simply put a comma (in the macro) as the replace character, so that completely does what's needed for most situations. However, by doing this, if the common English word appears at the beginning or end, though, then after running the macro, a comma will be at the beginning or end.

    Okay, no biggie, could the macro simply include the instruction to also find-a-comma-at-beginning-or-end-and-remove-it? Also, you may have noticed in the current code, if a word such as 'of' appears next to a word such as 'the' like this -- ,of,the, -- then only one gets stripped and the other one permanently remains because it ends up as part of the adjacent word, so that's why replacing words with a comma will both (1) insert the comma as generally needed and (2) also allow the macro to find the second adjacent "[word]" with a second (or third or fourth) pass of the macro.

    So, if there were simply a way of stripping a single comma from beginning or end as the final operation, perhaps as a second macro (?), then it looks as though all situations would be solved.

    What say you?

  15. #15
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    I'll make the needed changes to address the comma position problem. Afterwards, we can deal with any other issues that arise.

  16. #16
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    hi again, Leith,

    sure, sound great.

    On a side note, I've been studying the pattern that you came up with. It's very interesting, because the AND (&) operator is used to look for the item at any position (begin,middle,end), but does not require that the item be at begin,middle,end. I'm still studying it to see how it works. Do they call that a "look ahead" type of operator? For example, if the item (the word, the, for example) is at the beginning but not in the middle and not at the very end, then the final two-thirds of the pattern is just ignored, i.e, it's not required that Keyword actually exist in order to use the '&' operator between each segment of the pattern?

    My question probably doesn't make sense .. sorry .. was just trying to figure out how an AND operator can simply be overlooked when not needed. I'll study it some more.

    update: never mind .. it's the OR (|) operator that is busting it up into optional pieces. I see it now.
    Last edited by finlander; 06-08-2011 at 01:42 PM.

  17. #17
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    Here is the updated macro. This will ensure no commas are at the start or end of the string.
    Please Login or Register  to view this content.

  18. #18
    Registered User
    Join Date
    06-07-2011
    Location
    Boise, Idaho
    MS-Off Ver
    Excel 2003
    Posts
    20

    Re: remove common English words

    yep, it works perfectly, absolutely perfect. Well done, Leith!

  19. #19
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: remove common English words

    Hello finlander,

    To answer your previous question about the ampersand character, it is used to concatenate the variable string "keyword" into the Regular Expression pattern.

    Look aheads in VB Regular Expressions appear in this format: (?:exp), (?=exp), or (?!exp), where exp is the expression before looked for. Each language version of Regular Expressions has it differences, so these may not be familiar to you.

    The pattern looks for the keyword at the beginning of the string (marked by the caret "^"), the middle of the string, and the end of the string (marked by the dollar sign "$"). The pipe character ("|") is used to express an OR condition.

    Normally a pattern is fixed string. In this case, we are searching for a variable number of possible matches. By making keyword a separate string, it can be assigned whatever value we want to search for. Setting the Regular Expression property Global to True will search the string for all possible matches. The default setting is False. When false, only a single match is attempted.

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1