+ Reply to Thread
Results 1 to 10 of 10

How to dynamically-parse text from webpage to variables?

  1. #1
    Registered User
    Join Date
    05-17-2010
    Location
    Mankato, MN
    MS-Off Ver
    Excel 2003
    Posts
    11

    Question How to dynamically-parse text from webpage to variables?

    Greetings!

    Here's what I'm trying to accomplish: I have a program I wrote in Excel that takes real estate property information and auto-generates a graph of the data. There are typically 3-4 each of "Active", "Sold" and "Unsold" listings (max of 12 properties in this situation).

    One of my local Multiple Listing Services has the ability to export data as Excel spreadsheet data, so it takes nothing more than copy->paste from the MLS to my program, and the analyzing is complete (since it knows how to handle the data...it's already in the appropriate columns/rows).

    However, my bigger challenge is this: how to extract data to an array/columns+rows based on data that isn't initially-presented in spreadsheet format. (see attached PDF for sample data.)

    I was thinking that some type of Regular expression function to basically search for certain text strings, then extract out the variable name + value, and keep the data separated appropriately (arranging data so that it can be mapped to a spreadsheet table and analyzed). I'm finding that portion, well, impossible.

    One idea I thought of was setting up a dynamic regular expressions generator for each variable column (there are about 25 columns....asking price, listing date, sold price, etc) but I'm not sure if I can figure out a way to make it both user-friendly, *and* compatible. (I'm a n00b w/ RegEx). At that point, have the user print the listing data to PDF, and have a script or function to quickly scrape that data into a spreadsheet format by row/listing. (again: this is the part I can't figure out, or know where to start.)

    I've attached a text file that was generated by accessing the MLS database through Excel 2007, and copying the displayed single listing text as text values (without formatting). I also attached a PDF printout of a couple listings, which is a sample of what the user would see from the root database if they were not able to access the data in a spreadsheet view....this is what would need to be parsed/extracted.

    The other attachment "sample input data.xls" is what I basically need to get the extracted data (a la the PDF) into.

    Any ideas on how I might be able to accomplish this task of scraping/organizing bits of data into a spreadsheet from a non-spreadsheet setup? any help would be much appreciated! I've been wracking my brain on this for almost a week now...

    :::a

  2. #2
    Forum Contributor
    Join Date
    07-13-2007
    Posts
    151

    Re: How to dynamically-parse text from webpage to variables?

    I'd use an XMLHTTP request, or Internet Explorer automation to retrieve the HTML and then parse out the pieces I needed.

    You should take a look at the HTML behind the page and determine if the data is organized in such a manner that the elements you're interested in retrieving are easily identifiable.

    Go to the web page, right click on it, and select View Source to get a look behind the scenes. That will tell you a lot. If the data is in a simple HTML layout, it'll be easy. If it's Flash-driven images, you might be out of luck.

  3. #3
    Forum Contributor
    Join Date
    07-01-2008
    Location
    Cincinnati, OH
    Posts
    150

    Re: How to dynamically-parse text from webpage to variables?

    It might be easier (and more dependable) to parse the webpage itself as opposed to querying it into a spreadsheet and then digging the data out of cells. Post the HTML code of the webpage or provide me with access and I'll work with you a bit.

  4. #4
    Registered User
    Join Date
    05-17-2010
    Location
    Mankato, MN
    MS-Off Ver
    Excel 2003
    Posts
    11

    Re: How to dynamically-parse text from webpage to variables?

    Quote Originally Posted by Tom Schreiner View Post
    It might be easier (and more dependable) to parse the webpage itself as opposed to querying it into a spreadsheet and then digging the data out of cells. Post the HTML code of the webpage or provide me with access and I'll work with you a bit.
    You make a valid point....the less conversion to and fro is probably the way to go.

    In response to the question above, the data is presented in html.

    I've attached the HTML source file of a single listing view to this post.

    Here's the rub: the end-users of this program will be using various systems to access data, and each of those are programmed (and therefore, displayed) differently. It sounds like some method of parsing the raw html is the most-realistic way. Of course, the goal is to do is seamlessly, quickly and accurately.

    Thanks for the assistance with this! I sincerely appreciate it!


    :::a
    Attached Files Attached Files

  5. #5
    Registered User
    Join Date
    05-17-2010
    Location
    Mankato, MN
    MS-Off Ver
    Excel 2003
    Posts
    11

    Re: How to dynamically-parse text from webpage to variables?

    Quote Originally Posted by ShredDude View Post
    I'd use an XMLHTTP request, or Internet Explorer automation to retrieve the HTML and then parse out the pieces I needed.

    You should take a look at the HTML behind the page and determine if the data is organized in such a manner that the elements you're interested in retrieving are easily identifiable.

    Go to the web page, right click on it, and select View Source to get a look behind the scenes. That will tell you a lot. If the data is in a simple HTML layout, it'll be easy. If it's Flash-driven images, you might be out of luck.
    I'm not sure if an XMLHTTP request would be possible....while the data is presented in HTML, the database containing the root information requires the user to log in first in order to even search the data (and requires add'l layers of security that can't be automated). This program I'm making is designed for when they're already logged-in, and saved the list of 12 properties they'll be using for the report...from there, i want them to be able to select/copy the data to the clipboard, go into my program, click the "Process" button, and all the data is brought-in, formatted and analyzed. If it takes more than a few steps, it defeats the purpose of the program, which is why this is so frustrating....

    Thanks for the reply!

    :::a

  6. #6
    Forum Contributor
    Join Date
    07-01-2008
    Location
    Cincinnati, OH
    Posts
    150

    Re: How to dynamically-parse text from webpage to variables?

    Here's the rub: the end-users of this program will be using various systems to access data, and each of those are programmed (and therefore, displayed) differently. It sounds like some method of parsing the raw html is the most-realistic way. Of course, the goal is to do is seamlessly, quickly and accurately.

    (and requires add'l layers of security that can't be automated)

    There is no can't as far as I have seen. There may be some script to validate events such as mouse clicks or keyboard events but beyond cross scripting limitations over frames and the file upload control, you should be able to automate the process...

  7. #7
    Forum Contributor
    Join Date
    07-13-2007
    Posts
    151

    Re: How to dynamically-parse text from webpage to variables?

    I agree with Tom, you should still be able to automate it. You can probably capture the session credentials after a log in and use those to facilitate further XMLHTTP Posts against the server. That VIEWSTATE variable in the snippet you provided is probably returned with the first login and along with any cookie returned after login, could be used to facilitate further Posts.

    Unfortunately the HTML isn't well formatted, IMO. You'll have to spend some extra time building your parsing routine to find the element you want. It may be consistent enough t build a function that you pass the element's name to to do the extract. Looks like they use a </SPAN> after the heading and then a (lowercase) </span> after the corresponding value. That might get you there.

    Good luck with it!

  8. #8
    Forum Contributor
    Join Date
    07-13-2007
    Posts
    151

    Re: How to dynamically-parse text from webpage to variables?

    Quote Originally Posted by amateurmetheus View Post
    I'm not sure if an XMLHTTP request would be possible....while the data is presented in HTML, the database containing the root information requires the user to log in first in order to even search the data (and requires add'l layers of security that can't be automated). This program I'm making is designed for when they're already logged-in, and saved the list of 12 properties they'll be using for the report...from there, i want them to be able to select/copy the data to the clipboard, go into my program, click the "Process" button, and all the data is brought-in, formatted and analyzed. If it takes more than a few steps, it defeats the purpose of the program, which is why this is so frustrating....

    Thanks for the reply!

    :::a
    Just re-read your post...if the user will do all the navigating to generate this report of 12 properties in advance, then maybe another approach would work.

    You might try using a Shell Structure to capture the HTML from the window that contains this pregenerated report of 12 properties. For example, if that report always has a Caption that you could search for, you could loop through all the open browser windows, find the one with that caption, and then grab that window's document's innerthml. From there you could begin parsing what you need.

    Or, have your program begin the browser session through which they'll do their own navigating to generate this12 property report, and then have a button to click to grab what's currently in the browser. You could build a Userform with a webbrowser control to facilitate that interaction, and have a button on the userform to then grab it when its ready. That way you'd know which document to go after.

    With that setup, you wouldn't being copying anything to the clipboard, but grabbing the HTML directly from the browser, parsing what you need in your code, and then dropping the results into your spreadsheet where needed.

  9. #9
    Forum Contributor
    Join Date
    07-01-2008
    Location
    Cincinnati, OH
    Posts
    150

    Re: How to dynamically-parse text from webpage to variables?

    I have not let this post go but have not had the time to go through the HTML source you posted earlier. Please keep me up to date on your progress to avoid redundant work.

  10. #10
    Registered User
    Join Date
    05-17-2010
    Location
    Mankato, MN
    MS-Off Ver
    Excel 2003
    Posts
    11

    Re: How to dynamically-parse text from webpage to variables?

    Quote Originally Posted by Tom Schreiner View Post
    I have not let this post go but have not had the time to go through the HTML source you posted earlier. Please keep me up to date on your progress to avoid redundant work.
    Thanks for the update, Tom.

    I've found a temporary workaround that captures less information (with the option to update manually items that can't be grabbed in that manner). It'd still be ideal if I could have an interface/function that extracts the data from the raw html.

    Thanks so much for the help--I really do appreciate it!!

    :::a

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1