+ Reply to Thread
Results 1 to 15 of 15

Web scraping question

  1. #1
    Registered User
    Join Date
    07-29-2018
    Location
    In the Queen's Shed
    MS-Off Ver
    2013
    Posts
    23

    Web scraping question

    This is giving me a headache. Below is my code. This forum does not let me post URLs, which is ever so silly. Where it says "URL =" in the code below, enter any (URL in quotation marks) from a bibliography page listed on espacenet, e.g. for patent Nr. KR20190082024.

    The responsetext seems to return the web page header, but of course I want to grab the body. Using the InternetExplorer.Application object is no option; this is slow and prone to error.

    Please help.

    Please Login or Register  to view this content.
    Last edited by 6StringJazzer; 06-11-2020 at 03:13 PM.

  2. #2
    Administrator 6StringJazzer's Avatar
    Join Date
    01-27-2010
    Location
    Tysons Corner, VA, USA
    MS-Off Ver
    MS365 Family 64-bit
    Posts
    24,691

    Re: Web scraping question

    When I do this in my own code I get the body. Actually I just wrote some code to do this yesterday. PM me the URL and I will try it, and also add it to your post.

    This forum does not let me post URLs, which is ever so silly.
    After you have posted more times, it is allowed. New members cannot post URLs because 90% of the time it means it's spam.
    Jeff
    | | |·| |·| |·| |·| | |:| | |·| |·|
    Read the rules
    Use code tags to [code]enclose your code![/code]

  3. #3
    Registered User
    Join Date
    07-29-2018
    Location
    In the Queen's Shed
    MS-Off Ver
    2013
    Posts
    23

    Re: Web scraping question

    Quote Originally Posted by 6StringJazzer View Post
    PM me the URL and I will try it, and also add it to your post.
    Done & many thanks in advance for your help!

  4. #4
    Administrator 6StringJazzer's Avatar
    Join Date
    01-27-2010
    Location
    Tysons Corner, VA, USA
    MS-Off Ver
    MS365 Family 64-bit
    Posts
    24,691

    Re: Web scraping question

    Your call is retrieving the entire page. There are 52,488 characters in the page source, and you can only put 32,767 in a cell. There are also limits on line feeds, and how many characters can actually be displayed. You are not saving the header, you are saving the actual page--just not all of it.

    What do you ultimately need to do with the page source? If you really need it in cells, you can loop through and distribute it into multiple cells. I might be able to give you other ideas if you can give me the big picture of what you are trying to do.

  5. #5
    Registered User
    Join Date
    07-29-2018
    Location
    In the Queen's Shed
    MS-Off Ver
    2013
    Posts
    23

    Re: Web scraping question

    Quote Originally Posted by 6StringJazzer View Post
    I might be able to give you other ideas if you can give me the big picture of what you are trying to do.
    Please Login or Register  to view this content.
    I am trying to grab the abstract text, for which the html tag is "p". The code returns an error 91.

  6. #6
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Arrow Re: Web scraping question


    If the text you need is not in the responseText of the initial URL page code so you can't grab it with this method …

  7. #7
    Administrator 6StringJazzer's Avatar
    Join Date
    01-27-2010
    Location
    Tysons Corner, VA, USA
    MS-Off Ver
    MS365 Family 64-bit
    Posts
    24,691

    Re: Web scraping question

    The coding for the page you are retrieving is a bit complicated. I am not a web coding expert, but it looks like the page loads then there is JavaScript to retrieve the content on the client side after the page is loaded. You can see what I mean if you load the page in a browser then View Source. I am not sure how to retrieve that content by VBA.

  8. #8
    Registered User
    Join Date
    07-29-2018
    Location
    In the Queen's Shed
    MS-Off Ver
    2013
    Posts
    23

    Re: Web scraping question

    Quote Originally Posted by 6StringJazzer View Post
    I am not sure how to retrieve that content by VBA.
    I get it to work with the following code:
    Please Login or Register  to view this content.
    This way has some drawbacks. It needs switching on the reference libraries Microsoft HTML Office Library and Microsoft Internet Controls. Further, the wait line is needed, because the response is very slow, and the readySTATE line does not safeguard from errors. But it does get the text I am after, which is why I do not understand why it does not work with the late binding method I tried. Maybe the initial "GET" request is indeed to the wrong URL (?). Any thoughts?

  9. #9
    Valued Forum Contributor Naveed Raza's Avatar
    Join Date
    11-04-2012
    Location
    India, Hyderabad
    MS-Off Ver
    Excel, Access 2007/2010
    Posts
    1,338

    Re: Web scraping question

    Hi Pak,

    you can install Selenium Basic application from GITHub and you can write Selenium code in Excel. this will help you alot
    Thanks - Naveed
    -----------------------------
    If the suggestion helps you, then Click * to Add Reputation
    To Attach File: Go Advanced>>Manage Attachments>>Add Files (In Top Right Corner)>>SelectFiles>>.........Locate Your File(s)>>Upload Files>>Done (In Bottom Right)
    1. Use [code] code tags [\code]. It keeps posts clean, easy-to-read, and maintains VBA formatting.
    2. If you are happy with a solution to your problem, mark the thread as [SOLVED] using the tools at the top.

  10. #10
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Arrow Re: Web scraping question


    As Selenium is just for piloting a webbrowser so it may be easier but as slow as piloting directly IE
    and with the same synchronising issue …

  11. #11
    Administrator 6StringJazzer's Avatar
    Join Date
    01-27-2010
    Location
    Tysons Corner, VA, USA
    MS-Off Ver
    MS365 Family 64-bit
    Posts
    24,691

    Re: Web scraping question

    Quote Originally Posted by Pak Mariman View Post
    Maybe the initial "GET" request is indeed to the wrong URL (?). Any thoughts?
    Here is the difference between the two methods you used, and why the second one works.

    The first method performs the HTTP transaction. That is, VBA sends a page request directly to the target URL, and synchronously waits for a response. The server sends the page back, but it contains JavaScript, which VBA does not know how to execute. You get the page exactly as served. To the server, you look the same as a browser.

    The second method is a built-in but clunky way of handling web pages. Instead of sending a request directly to the server, you are asking Internet Explorer to do it for you, then when IE is all done, you can read ask IE to give you the page as it is displayed (which should be all HTML, but I haven't looked). You are letting IE execute the JavaScript for you. This is a bit slower because the VBA has to wait out whatever it is IE is doing. (Early vs. late binding is not the issue; it's a completely different method.)

    Since all you want is the abstract, I can help you isolate that if you are interested.

  12. #12
    Registered User
    Join Date
    07-29-2018
    Location
    In the Queen's Shed
    MS-Off Ver
    2013
    Posts
    23

    Re: Web scraping question

    Quote Originally Posted by 6StringJazzer View Post
    Since all you want is the abstract, I can help you isolate that if you are interested.
    I would be interested in some pointers if and how it is possible to do with the first method. I agree with you that the internet explorer-based method is clunky. I am a bit puzzled why Microsoft has not improved on that. I found that accessing e.g. Google Patents using the internet explorer method does not work at all.

  13. #13
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Arrow Re: Web scraping question


    The way you pilot IE is clunky but just well observing how this webpage works, well reading its code, …
    it does very not need to add a five seconds pause - less than a second is needed on my side - but just
    checking if the expected element exists, is ready or if a collection contains elements !
    Like you can see in many samples on web …

    As a training, just checking the element to grab you must first find an unique property better than its 'P' tag
    'cause even if it works now with this poor common tag
    imagine when the webpage will be updated with a new 'P' element added before the expected one …

  14. #14
    Administrator 6StringJazzer's Avatar
    Join Date
    01-27-2010
    Location
    Tysons Corner, VA, USA
    MS-Off Ver
    MS365 Family 64-bit
    Posts
    24,691

    Re: Web scraping question

    Quote Originally Posted by Pak Mariman View Post
    I would be interested in some pointers if and how it is possible to do with the first method. I agree with you that the internet explorer-based method is clunky. I am a bit puzzled why Microsoft has not improved on that. I found that accessing e.g. Google Patents using the internet explorer method does not work at all.
    It's just not possible with the first method unless someone has written a VBA-callable library to execute JavaScript.

    I have been bumping this around to other places and one suggestion was to use Google Sheets instead of Excel. This formula would retrieve the abstract:

    =importxml("https://worldwide.espacenet.com/data/publicationDetails/biblio?DB=EPODOC&II=0&ND=3&adjacent=true&locale=en_EP&FT=D&date=20190709&CC=KR&NR=20190082024A&KC=A&rnd=1592067094437","/html/body/div[1]/div[6]/div/div[2]/div[2]/p[1]")

  15. #15
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Arrow Re: Web scraping question


    VBA can execute JScript via some external ways like execScript & eval methods or the ScriptControl ActiveX …

    Is Google Sheets faster than Excel FILTERXML / WEBSERVICE formulas, PowerQuery
    or even just piloting IE via VBA (here ~ 2.5 seconds to grab the 'abstract' text) ?
    Last edited by Marc L; 06-14-2020 at 07:29 PM.

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Web scraping
    By vijay.jp in forum Excel Programming / VBA / Macros
    Replies: 5
    Last Post: 01-02-2015, 12:56 AM
  2. Web Scraping
    By Thomo88 in forum Excel Programming / VBA / Macros
    Replies: 1
    Last Post: 04-29-2014, 07:21 AM
  3. Replies: 1
    Last Post: 02-17-2014, 11:21 PM
  4. VBA Web Scraping question - unable to refresh/update map
    By shahzeb in forum Excel Programming / VBA / Macros
    Replies: 0
    Last Post: 01-24-2014, 12:40 PM
  5. Web Scraping Question
    By vexel77 in forum Excel Programming / VBA / Macros
    Replies: 7
    Last Post: 07-21-2013, 05:21 PM
  6. [SOLVED] Beginner question about scraping web pages with Excel VBA
    By sheilnaik in forum Excel Programming / VBA / Macros
    Replies: 1
    Last Post: 03-28-2013, 11:45 AM
  7. Web scraping
    By mickbarry in forum Excel Formulas & Functions
    Replies: 2
    Last Post: 02-01-2006, 06:25 AM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1