+ Reply to Thread
Results 1 to 8 of 8

Google Tag Manager & Web Scraping

  1. #1
    Valued Forum Contributor
    Join Date
    08-29-2012
    Location
    In lockdown
    MS-Off Ver
    Excel 2010 (2003 to 2016 but 2010 for choice)
    Posts
    1,766

    Question Google Tag Manager & Web Scraping

    HTML newbie here. I have been using VBA to scrape the web (by automating Internet Explorer) for a while now. (Yes, yes, I know there are other VBA scraping methods which are way faster but these don't work on certain websites)

    There is a certain website which I will call xyz.com for the purpose of this thread (1. I want to keep it confidential and 2. The websites membership is not available to the public so I will have to do the testing anyway)

    After you logon onto xyz.com, you can enter an ID into a search box to return a single record. Up to now I have been automating IE to loop through a list of IDs, running a search on each and finally return a formatted Excel workbook with the collected results. This method works but it is very slow.

    I want to decrease the overall running time. So recently I had a closer look at the page source. The site appears to be using something I have never heard of before called 'Google Tag Manager'? From what I can make out, the ID you enter in the inputbox is sent when you click on the Next button. This triggers a POST(?) to GTM using JavaScript(?) (I'm a HTML newbie!). It then loads a new page with the search results.

    I expect I could double the overall speed if I could cut out the need to return to the search page between each search results page. But to do that, I would need an alternative way of requesting the search results from the site.

    Is there a way of sending my desired search values to the JavaScript?/GoogleTagManager?
    *******************************************************

    HELP WANTED! (Links to Forum threads)
    Trying to create reusable code for Custom Events at Workbook (not Application) level

    *******************************************************

  2. #2
    Forum Guru Kyle123's Avatar
    Join Date
    03-10-2010
    Location
    Leeds
    MS-Off Ver
    365 Win 11
    Posts
    7,238

    Re: Google Tag Manager & Web Scraping

    Google tag manager is simply website analytics so the owner of the site can see who visits which page and what the behaviour is. Unless your search is something to do with google analytics, it is very unlikely that it has anything at all to do with the search and more likely that it is simply sending data back to google on your behaviour, I fear you may be on a wild goose chase as a result of having gotten the wrong end of the stick.

    As a very rough guide, to get the actual search data you need to work out whether it's a post or get request. So on the results page. Firstly does the whole page refresh when loading the results? If not it's likely the page is using JavaScript to fetch the data. If lit does then check whether the url at the top of the search page includes the search term. If it does, you should be able to navigate directly to it. If it doesn't then open the developer tools of your web browser, go to the network tab, turn on "preserve history/log" (varies depending on browser - it's a little circle in the top left on chrome) and make the request again. Once you've done that, you'll end up with all the requests made when loading the page - you'll be looking for one that shows the type of Post, there the yep is the search page, you'll then be looking in the request body section.

    This should give you an idea of what is actually going on to get you the search results.

  3. #3
    Valued Forum Contributor
    Join Date
    08-29-2012
    Location
    In lockdown
    MS-Off Ver
    Excel 2010 (2003 to 2016 but 2010 for choice)
    Posts
    1,766

    Re: Google Tag Manager & Web Scraping

    Quote Originally Posted by Kyle123 View Post
    Google tag manager is simply website analytics so the owner of the site can see who visits which page and what the behaviour is. Unless your search is something to do with google analytics, it is very unlikely that it has anything at all to do with the search and more likely that it is simply sending data back to google on your behaviour, I fear you may be on a wild goose chase as a result of having gotten the wrong end of the stick.

    As a very rough guide, to get the actual search data you need to work out whether it's a post or get request. So on the results page. Firstly does the whole page refresh when loading the results? If not it's likely the page is using JavaScript to fetch the data. If lit does then check whether the url at the top of the search page includes the search term. If it does, you should be able to navigate directly to it. If it doesn't then open the developer tools of your web browser, go to the network tab, turn on "preserve history/log" (varies depending on browser - it's a little circle in the top left on chrome) and make the request again. Once you've done that, you'll end up with all the requests made when loading the page - you'll be looking for one that shows the type of Post, there the yep is the search page, you'll then be looking in the request body section.

    This should give you an idea of what is actually going on to get you the search results.
    Good to hear from you again Kyle, I will check this out carefully sometime in the next few days and report back.

    (Just quickly: I think the site is using JavaScript; after entering the form and clicking Next, it takes you to a new page; the url before/after does not include any of the search values (otherwise this would have been easy if it did!); Thanks for the tip on 'preserve history' - this sounds promising)

  4. #4
    Valued Forum Contributor
    Join Date
    08-29-2012
    Location
    In lockdown
    MS-Off Ver
    Excel 2010 (2003 to 2016 but 2010 for choice)
    Posts
    1,766

    Re: Google Tag Manager & Web Scraping

    I turned on the Developer Tools, made a search, found the POST line, checked the DETAILS tab, and I have found something interesting in the 'Request body' tab. This shows a long string (not a URL) which includes the search value I entered!

    Er, what can I now do with this string?

  5. #5
    Forum Guru Kyle123's Avatar
    Join Date
    03-10-2010
    Location
    Leeds
    MS-Off Ver
    365 Win 11
    Posts
    7,238

    Re: Google Tag Manager & Web Scraping

    You need to replicate that request using WinHTTP or MSXML.XMLHTTP (the former is a better option). The request body is sent by passing a parameter to the send method. This will then return exactly the same data as the browser receives.

    As this is a secure site, you'll need to jump through a few hoops first though as you'll need WinHTTP to be authenticated so that it has a valid session cookie and can access the required page. Typically, you'd have a two step approach, first grab the session cookie, then make the actual search request.

    As an example (obviously this won't work)
    Please Login or Register  to view this content.

  6. #6
    Valued Forum Contributor
    Join Date
    08-29-2012
    Location
    In lockdown
    MS-Off Ver
    Excel 2010 (2003 to 2016 but 2010 for choice)
    Posts
    1,766

    Re: Google Tag Manager & Web Scraping

    I've been trying but I cant get this to work. I know HTML is new to me but even so!

    My first thought was that the site was being clever by checking if the user was using a browser or not. So I added request headers for User-Agent, Referer and Host. This didn't make any difference.

    My next thought was that maybe the site needed to have the search input page sent/get/post before I attempt to POST the values for the search results page. So I added lines for this. However all I got was 401s / Access denied.

    I have pasted an extract of my code at the bottom of this post.

    I think we can rule out two lines of enquiry:
    1. The variables being sent. I checked these against what the browsers Developer Log and they match exactly (as far as I can make out!)
    2. The initial login to the site. I am still using the browser automation for this part so I expect that would set any cookies etc. On a related note - the site assigns a GUID/Session key (visible in the URL) for each login. (The key is used in some of the variables sent)

    I am not sure what is preventing this from working.

    Next idea = Is it possible that the site could be reading a cookie on each search attempt? If possible then How would I use WinHTTP & Developer Log to find the cookie & send it on each search?


    Please Login or Register  to view this content.
    Last edited by mc84excel; 08-15-2016 at 10:58 PM.

  7. #7
    Forum Guru Kyle123's Avatar
    Join Date
    03-10-2010
    Location
    Leeds
    MS-Off Ver
    365 Win 11
    Posts
    7,238

    Re: Google Tag Manager & Web Scraping

    You need to include the login. Cookies are not shared, winhttp will need its own

  8. #8
    Valued Forum Contributor
    Join Date
    08-29-2012
    Location
    In lockdown
    MS-Off Ver
    Excel 2010 (2003 to 2016 but 2010 for choice)
    Posts
    1,766

    Re: Google Tag Manager & Web Scraping

    Quote Originally Posted by Kyle123 View Post
    You need to include the login. Cookies are not shared, winhttp will need its own
    Sorry Kyle. This is one of those cases where I understand each word by itself but not when they are all put together!

    Quote Originally Posted by Kyle123 View Post
    You need to include the login.
    You mean the login to the site? Where do I need to include this? In each POST? How do I include it? (The site login uses Windows Security popup and the browser automation enters in a username & password)

    Quote Originally Posted by Kyle123 View Post
    Cookies are not shared
    Not sure what you mean by Cookies are not shared? Wouldn't the site assign a cookie at login? Is it possible that the site could want to confirm the cookies existence or timeout period before accepting the search? (I don't know. Just a guess from a HTML newbie!)

    Quote Originally Posted by Kyle123 View Post
    winhttp will need its own
    Not sure what you mean by winhttp will need its own. It's own what? It's own cookie? I thought cookies were created by websites. How could I create a cookie using winhttp?

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. sales manager & purchase manager sheets command button error
    By rana19 in forum Excel Programming / VBA / Macros
    Replies: 16
    Last Post: 05-24-2016, 07:26 AM
  2. [SOLVED] Scraping data from Last.fm in Google Docs
    By peterromer in forum For Other Platforms(Mac, Google Docs, Mobile OS etc)
    Replies: 2
    Last Post: 05-20-2016, 11:05 AM
  3. email row contents based on cell values (google sheets populated by google forms)
    By reedg in forum For Other Platforms(Mac, Google Docs, Mobile OS etc)
    Replies: 0
    Last Post: 01-13-2016, 02:55 PM
  4. VBA Route Calculator - Google Maps Api 22 Minutes quicker than Actual Google Website
    By lookingforhelp1 in forum Excel Programming / VBA / Macros
    Replies: 2
    Last Post: 11-06-2015, 01:23 PM
  5. Macro for scraping phone number information from google maps (or alternative site)
    By carpenter09 in forum Excel Programming / VBA / Macros
    Replies: 0
    Last Post: 06-07-2015, 11:42 PM
  6. Replies: 0
    Last Post: 11-05-2009, 04:47 AM
  7. Replies: 3
    Last Post: 04-03-2008, 03:16 PM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1