+ Reply to Thread
Page 1 of 2 12 LastLast
Results 1 to 15 of 25

Thread: Extract Data from Web Page Source Code

  1. #1
    Registered User
    Join Date
    10-04-2011
    Location
    Swindon, England
    MS-Off Ver
    Excel 2007
    Posts
    2

    Post Extract Data from Web Page Source Code

    I am trying to find a piece of code that will allow data to be retrieved from the source of a HTML page and be placed in a column in excel.
    I have a spreadsheet that cointains the url of the web page, and have the code to view the source, however, what i can't do is extract data within a specific span tag.
    i.e. i want to be able to go to a web address in colum a (www.awebsite.com) and extract data between <span class = "image"> and</span> tags and paste this value into the colum B. The reason is i have a large amount of products to maintain and need to ensure they always have an image as the image paths change regularly. Can anyone help?

  2. #2
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello R1chard,

    Welcome to the Forum!

    Yes the data can be extracted from the between the tags. My question is why is the page source in the workbook? Surely, you don't need to save the entire page source do you?
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  3. #3
    Registered User
    Join Date
    10-04-2011
    Location
    Swindon, England
    MS-Off Ver
    Excel 2007
    Posts
    2

    Re: Extract Data from Web Page Source Code

    I may not have explained myself clearly before, the page source isn't in the workbook, i have a URL to a web page in the work book. I then have some code that opens the source code of this page (in IE), i then want to be able to extract the HTML and text between the span tags, and place only this selected text back in to excel next to the url. The part i am having trouble with is extracting only the data between the span tags. Any help would be gratefully received.

    This is what i have so far:
    Sub CheckImg()
    
      Dim IEapp As Object
      Dim ieDoc As Object
      Dim Site As String
      
        Site = Range("A1")
       
        
        Set IEapp = CreateObject("InternetExplorer.Application")
        IEapp.Navigate Site
        
          While IEapp.Busy
            DoEvents
          Wend
          
          Set ieDoc = IEapp.Document
          
      '*** the part i am missing is the bit that goes here that sereched the HTML source and
      '    extracts data from between the span tags and places it in cell B1
          
    Finish:
        IEapp.Quit
        Set IEapp = Nothing
        Set ieDoc = Nothing
        
        
    End Sub

  4. #4
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello R1chard,

    Sorry about the delay. I have been experiencing system problems for the past 2 days and everything has been unstable. None of my diagnostics has been able to pin point the problem. I finally got some code together for you despite these problems.

    The attached workbook contains 5 macros in 3 separate modules. This code works much faster than using Internet Explorer because it accesses the server directly to retrieve the page source. Another advantage is it returns the status of the server. So, if there is problem it can be identified. The data between the start and end tag is copied down the worksheet from a cell you specify. One of the macros is used to convert HTML amp codes i.e. &nbsp; into actual characters.

    The macro ScrapeData is setup to read a list of URLs. Each URL's parsed tag pair data is added to a single column on "Sheet2" below the header row. The data from the next URL is placed in the next column to the right. Previous data is cleared before the new data is copied. Have a look a let me know if you need any changes made.
    Attached Files Attached Files
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  5. #5
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Hello, Leith
    Thank you so much.
    With your code, I was able to retrieve the title of the pages.
    I have 260 pages to retrieve (a lot of) information from.
    One of the things I am not able to retrieve is the language because there is no end tag
    For example, in <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pt-pt" lang="pt-pt" > I tried something like "xml:lang" as start tag and ">" as end tag but the macro sends a Run-Time error '1004'
    Can you help?
    Thank you
    Rui

  6. #6
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello Rui,

    The code is written to extract the text between the tags. What you are looking for is a tag attribute. This macro will put all the language attributes of a web page into the collection called "Matches". This is zero based collection meaning the first element is at index zero of the collection.
    ' Written: October 20, 2011
    ' Author:  Leith Ross
    ' Summary: Finds all the langauge attributes on a web page
    
    Sub ParseLanguage()
    
        Dim ieApp As Object
        Dim ieDoc As Object
        Dim Lang As String
        Dim Matches As Object
        Dim RegExp As Object
        Dim Text As String
        Dim URL As String
        
            URL = "http://www.cpu-world.com/"
            
            Set ieApp = CreateObject("InternetExplorer.Application")
            
                ieApp.Navigate URL
                ieApp.Visible = True
                
                While ieApp.Busy And ieApp.ReadyState <> 4: DoEvents: Wend
                
                While ieDoc Is Nothing: Set ieDoc = ieApp.Document: DoEvents: Wend
                
                  ' Returns the source text
                    Text = ieDoc.body.outerHTML
                    
                  ' Parse out the language attributes for either XML or HTML
                    Set RegExp = CreateObject("VBScript.RegExp")
                    RegExp.Global = True
                    RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s"
             
                      ' Display the number of matching attributes and assign first match to the variable Lang
                        If RegExp.Test(Text) Then
                         ' Return a collection of all matches
                           Set Matches = RegExp.Execute(Text)
                           MsgBox Matches.Count & " Language attribute(s) found."
                           Lang = Matches(0)    ' First Match
                        End If
                
    End Sub
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  7. #7
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Thank you!
    One of the greatest things of your code is the possibility of scanning a lot of pages and take the necessary information.
    I have to retrieve information (<noscript>, <noembed>, and, of course, the primary language of the page and the changes that occur) in 260 pages of schools.
    The problem is that I have a tool to automate some of this retrieval but it gets fooled a lot,. For example, if <span lang=PT occurs in two consecutive paragraphs, the tool counts this behavior as good in terms of accessibility. In fact, is something like an error of the program used to create the HTML code.
    The tool gives me the numerical results, not the actual attributes, ALT values, etc.
    As I can see from what you say, this code works for a page at a time.
    Would it be possible to take the code that takes the URL's and analyses them and use it to obtain all the language codes in all my pages?

  8. #8
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    I don't see why not. Can you provide me with the code you are using now?
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  9. #9
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    And more help, please.
    After creating a new wokbook, I tried to run the code in a module.
    All it does is opening Internet Explorer and the page "http://www.cpu-world.com/"
    I get no results in Excel. :-(
    One other thing. If not possible to use a list of URL's, would it be possible to use the macro with the URL as an argument instead of being embeded in the code?
    Sorry for some errors that may occur (english and Excel languages - I'm not versed in one nor the other)

  10. #10
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    I was not clear about my request. Sorry.
    What I waned to say was: Would it be possible to take your code that reads the URLs and use it with ParseLanguage code?
    About the tool that I use, it is not mine and I don't have access to it.
    You can see it in action here (http://www.acesso.umic.pt/webax/examinator.php), if you want to but I think it is useful for some taks, not for this one (and it is in portuguese). I think it easier to take the info directily from the source code of the pages.

  11. #11
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    I can combine them but I need a little more information from you.

    Where would the language information go on the page?
    Would need all of the language attributes or just 1?
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  12. #12
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    I rewrote the macro to operate as function that takes the URL as its argument. It will then return the object. If the object is Nothing then no language attributes were found. There is a second macro which will the languages on the web page you posted.
    ' Written: October 20, 2011
    ' Author:  Leith Ross
    ' Summary: Returns all the langauge attributes on a web page
    
    Function GetLanguages(ByVal URL As String) As Object
    
        Dim ieApp As Object
        Dim ieDoc As Object
        Dim Lang As String
        Dim Matches As Object
        Dim RegExp As Object
        Dim Request As Object
        Dim Text As String
        
    
            On Error Resume Next
               Set Request = CreateObject("WinHttp.WinHttpRequest.5.1")
               If Request Is Nothing Then
                  Set Request = CreateObject("WinHttp.WinHttpRequest.5")
               End If
            Err.Clear
            On Error GoTo 0
    
            Request.Open "GET", URL, False
            Request.Send
        
            Text = Request.responsetext
                    
                  ' Parse out the language attributes for either XML or HTML
                    Set RegExp = CreateObject("VBScript.RegExp")
                    RegExp.Global = True
                    RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s"
             
                      ' Display the number of matching attributes and assign first match to the variable Lang
                        If RegExp.Test(Text) Then
                         ' Return a collection of all matches
                           Set Matches = RegExp.Execute(Text)
                        End If
                        
            Set GetLanguages = Matches
             
    End Function
    
    Sub LangTest()
    
        Dim Lang As Variant
        Dim Languages As Object
        Dim Msg As String
      
            Set Languages = GetLanguages("http://www.acesso.umic.pt/webax/examinator.php")
        
            If Not Languages Is Nothing Then
               For Each Lang In Languages
                   Msg = Msg & Lang & vbCrLf
               Next Lang
           
               MsgBox "The Lanaguage Attributes for this Page are:" & vbCrLf & vbCrLf & Msg
            End If
        
    End Sub
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  13. #13
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    I need to retrieve the primary language of the page (normally after de Doctype definition) and the changes in the language definition on the page (normally <span lang="PT" ........> </span> or <p lang="PT".... </p>
    In the first case, I would only need the language code ("PT", "EN-US",...)
    In the second case, as I have to determine if the language redefinition corresponds to a real change in the language of the document, the best solution would be to collect language code and the text that follows in every occurrence and send it to columns in a worksheetv like you did in the Scrape data (after this, I could do a manual check to see if the text is in Portuguese, English, etc. - this is important to assistive technologies like screen readers).
    My problem is: I can do the manual check in each individual page but it is a time-consuming process, because I need to check a lot of things (events like onmouseover, onfocus, the content of noscript,
    Your original code for the attributes helps me a lot because I can change the attribute and retrieve the information.

    As for the latest code, I tried it but I am afraid I don't know how I can pass the URL arguments to your code (or where they should be located in the spreadsheet).

  14. #14
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    I forgot to mention what I did:
    In a cell I wrote =Getlanguages(A2), where A2 as a URL
    When I wrote =Getlanguages(http://aebpc.pt/escola/), I get an error.

  15. #15
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    The macro returns a collection object of the matches. There would need to be additional code to copy those matches to another group of cells. I can expand the macro function to take a cell as another argument. It would then enter the languages found starting with that cell and go down. This could be called from the worksheet like a formula. Would that be a better option?
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.2.0