+ Reply to Thread
Results 1 to 5 of 5

Load webpage on TOPMAN website, extract prices from webpage into Spreadsheet

  1. #1
    Registered User
    Join Date
    01-19-2009
    Location
    UK
    MS-Off Ver
    2007
    Posts
    60

    Load webpage on TOPMAN website, extract prices from webpage into Spreadsheet

    Hi,


    ProductID's for the TOPMAN website refer to a specific product on the website. When you search for one of these codes on the website you are taken straight to the product page.

    I have a list of these ID's and would like to programatically search for each ID on the site and then extract the prices from the webpage into a spreadsheet.


    The search URL is:

    Please Login or Register  to view this content.
    The product code goes at the end like so:

    Please Login or Register  to view this content.
    Here is a sample list of productID's:

    6121878
    6700583
    7164888
    7164943
    8263254
    6104361
    8317649
    8588149
    8585322
    6707754
    8743382
    8743113
    8737514
    8742972
    8743456
    8433289


    I think you can use WinHTTPRequest for this but I'm not sure and I haven't had the chance to try yet.

    Any ideas?


    Thanks,
    .
    - AKK9 -

  2. #2
    Forum Guru Kyle123's Avatar
    Join Date
    03-10-2010
    Location
    Leeds
    MS-Off Ver
    2016 Win10
    Posts
    7,164

    Re: Load webpage on TOPMAN website, extract prices from webpage into Spreadsheet

    Dunno if you're still after this, but:
    Please Login or Register  to view this content.
    Would suffice. Having done a huge amount of webscraping of the arcadia group websites, a couple of pointers:

    They're all exactly the same site design and same infrastructure so whatever works on one, will work on them all.
    The url is for the most part irrelevant all that matters is the store ID and the catalogue ID, e.g:

    http://www.topman.com/webapp/wcs/sto...atalogId=33057 - TopShop
    http://www.topman.com/webapp/wcs/sto...atalogId=33055 - Miss Selfridge

    Beware of the catalogue id's they do change, and when they change, they'll break your scraping. For huge amounts of scraping, doing it in YQL and parsing it into Excel is much faster and easier than doing a separate request for each row in Excel - for example if you wanted all the clothing prices in a single category, this can be done in a single call, rather than call each page.

  3. #3
    Registered User
    Join Date
    01-19-2009
    Location
    UK
    MS-Off Ver
    2007
    Posts
    60

    Re: Load webpage on TOPMAN website, extract prices from webpage into Spreadsheet

    Thanks! I got this sorted in the end, I used WinHTTPRequest, Document Object and Regex to pull all kinds of details from the product pages. If I wanted to get a whole category I'd just go to the category page and use DOM again.

    Not familiar with YQL though, how does it work?

  4. #4
    Forum Guru Kyle123's Avatar
    Join Date
    03-10-2010
    Location
    Leeds
    MS-Off Ver
    2016 Win10
    Posts
    7,164

    Re: Load webpage on TOPMAN website, extract prices from webpage into Spreadsheet

    Regex is awfully nasty for parsing html, you'd be better off with the HTML object.

    YQL (Yahoo Query Language) allows you to query websites with a SQL like syntax, for example:
    PHP Code: 
    select from html where url="http://www.missselfridge.com/webapp/wcs/stores/servlet/CatalogNavigationSearchResultCmd?catalogId=33055&storeId=12554&langId=-1&viewAllFlag=false&sort_field=Relevance&categoryId=208118&parent_categoryId=208117&beginIndex=1&pageSize=200" and xpath="//ul[@class='product']" 
    The url it generates: http://query.yahooapis.com/v1/public...uct'%5D%22

    So you can see what you'd load with Excel.

    You use xpath to return the parts you're interested in and loop through those, in the above case all the ul elements with a class of product.

    It starts getting really nifty though for multiple websites with the same layout - like arcadia, since it allows you to use an in clause:

    http://developer.yahoo.com/yql/conso...oduct%27%5D%22

    PHP Code: 
    select from html where url in ("http://www.missselfridge.com/webapp/wcs/stores/servlet/CatalogNavigationSearchResultCmd?catalogId=33056&storeId=12555&langId=-1&viewAllFlag=false&sort_field=Relevance&categoryId=207200&parent_categoryId=207169&beginIndex=1&pageSize=200""http://www.missselfridge.com/webapp/wcs/stores/servlet/CatalogNavigationSearchResultCmd?catalogId=33055&storeId=12554&langId=-1&viewAllFlag=false&sort_field=Relevance&categoryId=208118&parent_categoryId=208117&beginIndex=1&pageSize=200","http://www.topshop.com/webapp/wcs/stores/servlet/CatalogNavigationSearchResultCmd?catalogId=33057&storeId=12556&langId=-1&viewAllFlag=false&sort_field=Relevance&categoryId=208528&parent_categoryId=203984&beginIndex=1&pageSize=200") and xpath="//ul[@class='product']" 
    So you can aggregate site data using common clauses, this outputs:

    http://query.yahooapis.com/v1/public...agnostics=true

  5. #5
    Registered User
    Join Date
    01-19-2009
    Location
    UK
    MS-Off Ver
    2007
    Posts
    60

    Re: Load webpage on TOPMAN website, extract prices from webpage into Spreadsheet

    No the Regex wasn't for the HTML, I used the Document Object for that. I used Regex because on some sites there was some SKU specific info (like colour/size/prices) in JavaScript variables which I couldn't get with the Document Object.

    YQL looks great, but isn't it the same as using .getElementsByName/Class/Tag and looping through them? Maybe I'm missing something?

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1