I am trying to find a piece of code that will allow data to be retrieved from the source of a HTML page and be placed in a column in excel.
I have a spreadsheet that cointains the url of the web page, and have the code to view the source, however, what i can't do is extract data within a specific span tag.
i.e. i want to be able to go to a web address in colum a (www.awebsite.com) and extract data between <span class = "image"> and</span> tags and paste this value into the colum B. The reason is i have a large amount of products to maintain and need to ensure they always have an image as the image paths change regularly. Can anyone help?
Hello R1chard,
Welcome to the Forum!
Yes the data can be extracted from the between the tags. My question is why is the page source in the workbook? Surely, you don't need to save the entire page source do you?
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
I may not have explained myself clearly before, the page source isn't in the workbook, i have a URL to a web page in the work book. I then have some code that opens the source code of this page (in IE), i then want to be able to extract the HTML and text between the span tags, and place only this selected text back in to excel next to the url. The part i am having trouble with is extracting only the data between the span tags. Any help would be gratefully received.
This is what i have so far:
Sub CheckImg() Dim IEapp As Object Dim ieDoc As Object Dim Site As String Site = Range("A1") Set IEapp = CreateObject("InternetExplorer.Application") IEapp.Navigate Site While IEapp.Busy DoEvents Wend Set ieDoc = IEapp.Document '*** the part i am missing is the bit that goes here that sereched the HTML source and ' extracts data from between the span tags and places it in cell B1 Finish: IEapp.Quit Set IEapp = Nothing Set ieDoc = Nothing End Sub
Hello R1chard,
Sorry about the delay. I have been experiencing system problems for the past 2 days and everything has been unstable. None of my diagnostics has been able to pin point the problem. I finally got some code together for you despite these problems.
The attached workbook contains 5 macros in 3 separate modules. This code works much faster than using Internet Explorer because it accesses the server directly to retrieve the page source. Another advantage is it returns the status of the server. So, if there is problem it can be identified. The data between the start and end tag is copied down the worksheet from a cell you specify. One of the macros is used to convert HTML amp codes i.e. into actual characters.
The macro ScrapeData is setup to read a list of URLs. Each URL's parsed tag pair data is added to a single column on "Sheet2" below the header row. The data from the next URL is placed in the next column to the right. Previous data is cleared before the new data is copied. Have a look a let me know if you need any changes made.
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
Hello, Leith
Thank you so much.
With your code, I was able to retrieve the title of the pages.
I have 260 pages to retrieve (a lot of) information from.
One of the things I am not able to retrieve is the language because there is no end tag
For example, in <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pt-pt" lang="pt-pt" > I tried something like "xml:lang" as start tag and ">" as end tag but the macro sends a Run-Time error '1004'
Can you help?
Thank you
Rui
Hello Rui,
The code is written to extract the text between the tags. What you are looking for is a tag attribute. This macro will put all the language attributes of a web page into the collection called "Matches". This is zero based collection meaning the first element is at index zero of the collection.
' Written: October 20, 2011 ' Author: Leith Ross ' Summary: Finds all the langauge attributes on a web page Sub ParseLanguage() Dim ieApp As Object Dim ieDoc As Object Dim Lang As String Dim Matches As Object Dim RegExp As Object Dim Text As String Dim URL As String URL = "http://www.cpu-world.com/" Set ieApp = CreateObject("InternetExplorer.Application") ieApp.Navigate URL ieApp.Visible = True While ieApp.Busy And ieApp.ReadyState <> 4: DoEvents: Wend While ieDoc Is Nothing: Set ieDoc = ieApp.Document: DoEvents: Wend ' Returns the source text Text = ieDoc.body.outerHTML ' Parse out the language attributes for either XML or HTML Set RegExp = CreateObject("VBScript.RegExp") RegExp.Global = True RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s" ' Display the number of matching attributes and assign first match to the variable Lang If RegExp.Test(Text) Then ' Return a collection of all matches Set Matches = RegExp.Execute(Text) MsgBox Matches.Count & " Language attribute(s) found." Lang = Matches(0) ' First Match End If End Sub
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
Thank you!
One of the greatest things of your code is the possibility of scanning a lot of pages and take the necessary information.
I have to retrieve information (<noscript>, <noembed>, and, of course, the primary language of the page and the changes that occur) in 260 pages of schools.
The problem is that I have a tool to automate some of this retrieval but it gets fooled a lot,. For example, if <span lang=PT occurs in two consecutive paragraphs, the tool counts this behavior as good in terms of accessibility. In fact, is something like an error of the program used to create the HTML code.
The tool gives me the numerical results, not the actual attributes, ALT values, etc.
As I can see from what you say, this code works for a page at a time.
Would it be possible to take the code that takes the URL's and analyses them and use it to obtain all the language codes in all my pages?
Hello Ru1,
I don't see why not. Can you provide me with the code you are using now?
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
And more help, please.
After creating a new wokbook, I tried to run the code in a module.
All it does is opening Internet Explorer and the page "http://www.cpu-world.com/"
I get no results in Excel. :-(
One other thing. If not possible to use a list of URL's, would it be possible to use the macro with the URL as an argument instead of being embeded in the code?
Sorry for some errors that may occur (english and Excel languages - I'm not versed in one nor the other)
I was not clear about my request. Sorry.
What I waned to say was: Would it be possible to take your code that reads the URLs and use it with ParseLanguage code?
About the tool that I use, it is not mine and I don't have access to it.
You can see it in action here (http://www.acesso.umic.pt/webax/examinator.php), if you want to but I think it is useful for some taks, not for this one (and it is in portuguese). I think it easier to take the info directily from the source code of the pages.
Hello Ru1,
I can combine them but I need a little more information from you.
Where would the language information go on the page?
Would need all of the language attributes or just 1?
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
Hello Ru1,
I rewrote the macro to operate as function that takes the URL as its argument. It will then return the object. If the object is Nothing then no language attributes were found. There is a second macro which will the languages on the web page you posted.
' Written: October 20, 2011 ' Author: Leith Ross ' Summary: Returns all the langauge attributes on a web page Function GetLanguages(ByVal URL As String) As Object Dim ieApp As Object Dim ieDoc As Object Dim Lang As String Dim Matches As Object Dim RegExp As Object Dim Request As Object Dim Text As String On Error Resume Next Set Request = CreateObject("WinHttp.WinHttpRequest.5.1") If Request Is Nothing Then Set Request = CreateObject("WinHttp.WinHttpRequest.5") End If Err.Clear On Error GoTo 0 Request.Open "GET", URL, False Request.Send Text = Request.responsetext ' Parse out the language attributes for either XML or HTML Set RegExp = CreateObject("VBScript.RegExp") RegExp.Global = True RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s" ' Display the number of matching attributes and assign first match to the variable Lang If RegExp.Test(Text) Then ' Return a collection of all matches Set Matches = RegExp.Execute(Text) End If Set GetLanguages = Matches End Function Sub LangTest() Dim Lang As Variant Dim Languages As Object Dim Msg As String Set Languages = GetLanguages("http://www.acesso.umic.pt/webax/examinator.php") If Not Languages Is Nothing Then For Each Lang In Languages Msg = Msg & Lang & vbCrLf Next Lang MsgBox "The Lanaguage Attributes for this Page are:" & vbCrLf & vbCrLf & Msg End If End Sub
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
I need to retrieve the primary language of the page (normally after de Doctype definition) and the changes in the language definition on the page (normally <span lang="PT" ........> </span> or <p lang="PT".... </p>
In the first case, I would only need the language code ("PT", "EN-US",...)
In the second case, as I have to determine if the language redefinition corresponds to a real change in the language of the document, the best solution would be to collect language code and the text that follows in every occurrence and send it to columns in a worksheetv like you did in the Scrape data (after this, I could do a manual check to see if the text is in Portuguese, English, etc. - this is important to assistive technologies like screen readers).
My problem is: I can do the manual check in each individual page but it is a time-consuming process, because I need to check a lot of things (events like onmouseover, onfocus, the content of noscript,
Your original code for the attributes helps me a lot because I can change the attribute and retrieve the information.
As for the latest code, I tried it but I am afraid I don't know how I can pass the URL arguments to your code (or where they should be located in the spreadsheet).
I forgot to mention what I did:
In a cell I wrote =Getlanguages(A2), where A2 as a URL
When I wrote =Getlanguages(http://aebpc.pt/escola/), I get an error.
Hello Ru1,
The macro returns a collection object of the matches. There would need to be additional code to copy those matches to another group of cells. I can expand the macro function to take a cell as another argument. It would then enter the languages found starting with that cell and go down. This could be called from the worksheet like a formula. Would that be a better option?
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks