+ Reply to Thread
Page 2 of 2 FirstFirst 12
Results 16 to 25 of 25

Thread: Extract Data from Web Page Source Code

  1. #16
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Thank you for your prompt reply.
    I could not put the latest code to work. I don't know what I am doing wrong
    So, frustrated, I tried your first code and changed the attributes to read everything between <span lang=> and </span> and it worked!!! I already manually checked some pages and it returned the occurences correctly!

    I think for the language the only problem now is the primary language.

    1. Do you think it is possible to have the primary language of a series of URLs returned to another sheet (ou to a column next to the URLs)?

    2. I will try with the other information IU need to retrieve and tell you what happened.

  2. #17
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    This make work better. This can be used on the worksheet like a formula. You can call the URL by cell reference. The macro will return the language attributes as a comma separated list in the cell.
    ' Written: October 20, 2011
    ' Author:  Leith Ross
    ' Summary: Returns all the langauge attributes on a web page
    
    Function GetLanguages(ByVal URL As String)
    
        Dim Languages As Variant
        Dim Matches As Object
        Dim N As Long
        Dim RegExp As Object
        Dim Request As Object
        Dim Text As String
        
    
            On Error Resume Next
               Set Request = CreateObject("WinHttp.WinHttpRequest.5.1")
               If Request Is Nothing Then
                  Set Request = CreateObject("WinHttp.WinHttpRequest.5")
               End If
            Err.Clear
            On Error GoTo 0
    
            Request.Open "GET", URL, False
            Request.Send
        
            Text = Request.responsetext
                    
                  ' Parse out the language attributes for either XML or HTML
                    Set RegExp = CreateObject("VBScript.RegExp")
                    RegExp.Global = True
                    RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s"
             
                      ' Display the number of matching attributes and assign first match to the variable Lang
                        If RegExp.Test(Text) Then
                         ' Return a collection of all matches
                           Set Matches = RegExp.Execute(Text)
                               For Each Match In Matches
                                  N = N + 1
                                  Languages = Languages & Match & ","
                               Next Match
                        End If
                        
            GetLanguages = Left(Languages, Len(Languages) - 1)
             
    End Function

    Calling the Function on the Worksheet
      ' Column "A" has the URLs, Column "B" this UDF
        =GetLanguage(A1)
        =GetLanguage(A2)
        =GetLanguage(A3)
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  3. #18
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    It worked perfectly!

    I will try the collect the other infos and will tell you how it worked.
    Thank you so much!

  4. #19
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    Glad to hear the good news. I will be looking forward to your next results.
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  5. #20
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Hello Leith
    Do you think it would be possible to use the original code to obtain the information from saved pages (or saved source code), instead of the live online page?
    I may have to "freeze" the results and I think this would be a good way to do it.

  6. #21
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Hello, Leith:

    I managed to retrieve some of the infos I needed like the content of <noscript>, etc.
    Unfortunately, I have difficulties retrieving
    - The content of the ALT attribute
    - The occurrence of <b>, <i>, <u>, <em>, <strong>
    - The occurrence of headers <h1> to <h6>
    - The occurences of onmouseover, onfocus, onblur, onclick, onkeypress, onmouseout
    :-(
    Also, as the analysis takes time, it would be very interesting to do the analysis on a sample from day X, something like - the sample as it was in 25 october, the analysis done from 25 october to 10 november". But for this, the code would read from the saved pages.

  7. #22
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Hello Leith:
    Through search, I realised that there are programs that do something similar to your code but they are normally used to retrieve data. I downloaded three: Data Extractor, DEiXTo (a greek program) and one aptly named Extract Data & Text From Multiple Web Sites Software. And even a Firefox extra (OutWit Hub Light).
    Unfortunately, I cannot enter the parameters to search correctly so I'm stuck.
    I would ask for help but I think the number of things I ask for demand a lot of work.
    So I understand perfectly if you cannot answer my request.
    Anyway, I must thank you for your enormous help in some of my needs and to make me start to think in the possibilities (and even learn a little - because I cannot learn more, not because of the excellent teacher). Thank you

  8. #23
    Forum Moderator Leith Ross's Avatar
    Join Date
    01-15-2005
    Location
    San Francisco, Ca
    MS-Off Ver
    2000, 2003, & read 2007
    Posts
    15,979

    Re: Extract Data from Web Page Source Code

    Hello Ru1,

    The macros I have written are for specific searches and were not intended to perform a broad range of search options. However, I have been working on such a program but it is far from complete.
    Sincerely,
    Leith Ross

    Remember To Do the Following....

    1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.
    2. Thank those who have helped you by clicking the Star below the post.
    3. Please mark your post [SOLVED] if it has been answered satisfactorily.


    Old Scottish Proverb...
    Luathaid gu deanamh maille! (Rushing causes delays!)

  9. #24
    Registered User
    Join Date
    10-20-2011
    Location
    N/A
    MS-Off Ver
    Excel 2003
    Posts
    12

    Re: Extract Data from Web Page Source Code

    Yes, I understand that these macros are specific.
    Thus, I tried to change the search parameters in the first code and I succeeded in some cases. Then I tried to understand RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s" and went to the Microsoft help to try to understand this Reg.Exp but to no avail.
    Does your program search some of the strings I need? In some cases, I can see patterns like: if it finds ALT, it returns the text surrounded by quotes that follows and I could change these to suit my needs. Also, in some cases I just need to search for strings of text - if it finds "onmouseover", it puts this in a column.
    If it has these capabilities and is usable for someone with basic skills, I wouli appreciate it.

  10. #25
    Registered User
    Join Date
    11-30-2011
    Location
    Raleigh, NC
    MS-Off Ver
    Excel 2010
    Posts
    1

    Re: Extract Data from Web Page Source Code

    This thread has been very helpful! I was able to use Leith's code to extract data from a web page successfully when I use StartTag = <a single tag>. However, the ScrapeData subroutine stopped working when I tried to search for nested tags.

    Example:
    <div class="column secondary review-body">
    <p class="review-text">
    <span>
    Need to retrieve this data.
    </span>

    Any thoughts on how I can get the code to work with nested tags?

    Thanks!

    LJ

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.2.0