+ Reply to Thread
Results 1 to 26 of 26

Extract Data from Web Page Source Code

  1. #1
    Registered User
    Join Date
    10-04-2011
    Location
    Swindon, England
    MS-Off Ver
    Excel 2007
    Posts
    2

    Post Extract Data from Web Page Source Code

    I am trying to find a piece of code that will allow data to be retrieved from the source of a HTML page and be placed in a column in excel.
    I have a spreadsheet that cointains the url of the web page, and have the code to view the source, however, what i can't do is extract data within a specific span tag.
    i.e. i want to be able to go to a web address in colum a (www.awebsite.com) and extract data between <span class = "image"> and</span> tags and paste this value into the colum B. The reason is i have a large amount of products to maintain and need to ensure they always have an image as the image paths change regularly. Can anyone help?

  2. #2
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello R1chard,

    Welcome to the Forum!

    Yes the data can be extracted from the between the tags. My question is why is the page source in the workbook? Surely, you don't need to save the entire page source do you?
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  3. #3
    Registered User
    Join Date
    10-04-2011
    Location
    Swindon, England
    MS-Off Ver
    Excel 2007
    Posts
    2

    Re: Extract Data from Web Page Source Code

    I may not have explained myself clearly before, the page source isn't in the workbook, i have a URL to a web page in the work book. I then have some code that opens the source code of this page (in IE), i then want to be able to extract the HTML and text between the span tags, and place only this selected text back in to excel next to the url. The part i am having trouble with is extracting only the data between the span tags. Any help would be gratefully received.

    This is what i have so far:
    Please Login or Register  to view this content.

  4. #4
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello R1chard,

    Sorry about the delay. I have been experiencing system problems for the past 2 days and everything has been unstable. None of my diagnostics has been able to pin point the problem. I finally got some code together for you despite these problems.

    The attached workbook contains 5 macros in 3 separate modules. This code works much faster than using Internet Explorer because it accesses the server directly to retrieve the page source. Another advantage is it returns the status of the server. So, if there is problem it can be identified. The data between the start and end tag is copied down the worksheet from a cell you specify. One of the macros is used to convert HTML amp codes i.e. &nbsp; into actual characters.

    The macro ScrapeData is setup to read a list of URLs. Each URL's parsed tag pair data is added to a single column on "Sheet2" below the header row. The data from the next URL is placed in the next column to the right. Previous data is cleared before the new data is copied. Have a look a let me know if you need any changes made.
    Attached Files Attached Files

  5. #5
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Hello, Leith
    Thank you so much.
    With your code, I was able to retrieve the title of the pages.
    I have 260 pages to retrieve (a lot of) information from.
    One of the things I am not able to retrieve is the language because there is no end tag
    For example, in <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pt-pt" lang="pt-pt" > I tried something like "xml:lang" as start tag and ">" as end tag but the macro sends a Run-Time error '1004'
    Can you help?
    Thank you
    Rui

  6. #6
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello Rui,

    The code is written to extract the text between the tags. What you are looking for is a tag attribute. This macro will put all the language attributes of a web page into the collection called "Matches". This is zero based collection meaning the first element is at index zero of the collection.
    Please Login or Register  to view this content.

  7. #7
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Thank you!
    One of the greatest things of your code is the possibility of scanning a lot of pages and take the necessary information.
    I have to retrieve information (<noscript>, <noembed>, and, of course, the primary language of the page and the changes that occur) in 260 pages of schools.
    The problem is that I have a tool to automate some of this retrieval but it gets fooled a lot,. For example, if <span lang=PT occurs in two consecutive paragraphs, the tool counts this behavior as good in terms of accessibility. In fact, is something like an error of the program used to create the HTML code.
    The tool gives me the numerical results, not the actual attributes, ALT values, etc.
    As I can see from what you say, this code works for a page at a time.
    Would it be possible to take the code that takes the URL's and analyses them and use it to obtain all the language codes in all my pages?

  8. #8
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    I don't see why not. Can you provide me with the code you are using now?

  9. #9
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    And more help, please.
    After creating a new wokbook, I tried to run the code in a module.
    All it does is opening Internet Explorer and the page "http://www.cpu-world.com/"
    I get no results in Excel. :-(
    One other thing. If not possible to use a list of URL's, would it be possible to use the macro with the URL as an argument instead of being embeded in the code?
    Sorry for some errors that may occur (english and Excel languages - I'm not versed in one nor the other)

  10. #10
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    I was not clear about my request. Sorry.
    What I waned to say was: Would it be possible to take your code that reads the URLs and use it with ParseLanguage code?
    About the tool that I use, it is not mine and I don't have access to it.
    You can see it in action here (http://www.acesso.umic.pt/webax/examinator.php), if you want to but I think it is useful for some taks, not for this one (and it is in portuguese). I think it easier to take the info directily from the source code of the pages.

  11. #11
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    I can combine them but I need a little more information from you.

    Where would the language information go on the page?
    Would need all of the language attributes or just 1?

  12. #12
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    I rewrote the macro to operate as function that takes the URL as its argument. It will then return the object. If the object is Nothing then no language attributes were found. There is a second macro which will the languages on the web page you posted.
    Please Login or Register  to view this content.

  13. #13
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    I need to retrieve the primary language of the page (normally after de Doctype definition) and the changes in the language definition on the page (normally <span lang="PT" ........> </span> or <p lang="PT".... </p>
    In the first case, I would only need the language code ("PT", "EN-US",...)
    In the second case, as I have to determine if the language redefinition corresponds to a real change in the language of the document, the best solution would be to collect language code and the text that follows in every occurrence and send it to columns in a worksheetv like you did in the Scrape data (after this, I could do a manual check to see if the text is in Portuguese, English, etc. - this is important to assistive technologies like screen readers).
    My problem is: I can do the manual check in each individual page but it is a time-consuming process, because I need to check a lot of things (events like onmouseover, onfocus, the content of noscript,
    Your original code for the attributes helps me a lot because I can change the attribute and retrieve the information.

    As for the latest code, I tried it but I am afraid I don't know how I can pass the URL arguments to your code (or where they should be located in the spreadsheet).

  14. #14
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    I forgot to mention what I did:
    In a cell I wrote =Getlanguages(A2), where A2 as a URL
    When I wrote =Getlanguages(http://aebpc.pt/escola/), I get an error.

  15. #15
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    The macro returns a collection object of the matches. There would need to be additional code to copy those matches to another group of cells. I can expand the macro function to take a cell as another argument. It would then enter the languages found starting with that cell and go down. This could be called from the worksheet like a formula. Would that be a better option?

  16. #16
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Thank you for your prompt reply.
    I could not put the latest code to work. I don't know what I am doing wrong
    So, frustrated, I tried your first code and changed the attributes to read everything between <span lang=> and </span> and it worked!!! I already manually checked some pages and it returned the occurences correctly!

    I think for the language the only problem now is the primary language.

    1. Do you think it is possible to have the primary language of a series of URLs returned to another sheet (ou to a column next to the URLs)?

    2. I will try with the other information IU need to retrieve and tell you what happened.

  17. #17
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    This make work better. This can be used on the worksheet like a formula. You can call the URL by cell reference. The macro will return the language attributes as a comma separated list in the cell.
    Please Login or Register  to view this content.

    Calling the Function on the Worksheet
    Please Login or Register  to view this content.

  18. #18
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    It worked perfectly!

    I will try the collect the other infos and will tell you how it worked.
    Thank you so much!

  19. #19
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    Glad to hear the good news. I will be looking forward to your next results.

  20. #20
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Hello Leith
    Do you think it would be possible to use the original code to obtain the information from saved pages (or saved source code), instead of the live online page?
    I may have to "freeze" the results and I think this would be a good way to do it.

  21. #21
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Hello, Leith:

    I managed to retrieve some of the infos I needed like the content of <noscript>, etc.
    Unfortunately, I have difficulties retrieving
    - The content of the ALT attribute
    - The occurrence of <b>, <i>, <u>, <em>, <strong>
    - The occurrence of headers <h1> to <h6>
    - The occurences of onmouseover, onfocus, onblur, onclick, onkeypress, onmouseout
    :-(
    Also, as the analysis takes time, it would be very interesting to do the analysis on a sample from day X, something like - the sample as it was in 25 october, the analysis done from 25 october to 10 november". But for this, the code would read from the saved pages.

  22. #22
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Hello Leith:
    Through search, I realised that there are programs that do something similar to your code but they are normally used to retrieve data. I downloaded three: Data Extractor, DEiXTo (a greek program) and one aptly named Extract Data & Text From Multiple Web Sites Software. And even a Firefox extra (OutWit Hub Light).
    Unfortunately, I cannot enter the parameters to search correctly so I'm stuck.
    I would ask for help but I think the number of things I ask for demand a lot of work.
    So I understand perfectly if you cannot answer my request.
    Anyway, I must thank you for your enormous help in some of my needs and to make me start to think in the possibilities (and even learn a little - because I cannot learn more, not because of the excellent teacher). Thank you

  23. #23
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & 2010
    Posts
    23,258

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    The macros I have written are for specific searches and were not intended to perform a broad range of search options. However, I have been working on such a program but it is far from complete.

  24. #24
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Yes, I understand that these macros are specific.
    Thus, I tried to change the search parameters in the first code and I succeeded in some cases. Then I tried to understand RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s" and went to the Microsoft help to try to understand this Reg.Exp but to no avail.
    Does your program search some of the strings I need? In some cases, I can see patterns like: if it finds ALT, it returns the text surrounded by quotes that follows and I could change these to suit my needs. Also, in some cases I just need to search for strings of text - if it finds "onmouseover", it puts this in a column.
    If it has these capabilities and is usable for someone with basic skills, I wouli appreciate it.

  25. #25
    Registered User
    Join Date
    11-30-2011
    Location
    Raleigh, NC
    MS-Off Ver
    Excel 2010
    Posts
    1

    Re: Extract Data from Web Page Source Code

    This thread has been very helpful! I was able to use Leith's code to extract data from a web page successfully when I use StartTag = <a single tag>. However, the ScrapeData subroutine stopped working when I tried to search for nested tags.

    Example:
    <div class="column secondary review-body">
    <p class="review-text">
    <span>
    Need to retrieve this data.
    </span>

    Any thoughts on how I can get the code to work with nested tags?

    Thanks!

    LJ

  26. #26
    Registered User
    Join Date
    03-30-2014
    Location
    Brussels
    MS-Off Ver
    Excel 2010
    Posts
    1

    Re: Extract Data from Web Page Source Code

    Hello Leith,
    I tried your code and it works perfect. My problem is that I also work on another computer which uses a proxy to access internet. This cause the code no to work. Would it be possible to have the same code which would access the internet through internet explorer instead of a direct access from excel to the web?

    Thanks,
    Laurent

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1