Thank you for your prompt reply.
I could not put the latest code to work. I don't know what I am doing wrong
So, frustrated, I tried your first code and changed the attributes to read everything between <span lang=> and </span> and it worked!!! I already manually checked some pages and it returned the occurences correctly!
I think for the language the only problem now is the primary language.
1. Do you think it is possible to have the primary language of a series of URLs returned to another sheet (ou to a column next to the URLs)?
2. I will try with the other information IU need to retrieve and tell you what happened.
Hello Ru1,
This make work better. This can be used on the worksheet like a formula. You can call the URL by cell reference. The macro will return the language attributes as a comma separated list in the cell.
' Written: October 20, 2011 ' Author: Leith Ross ' Summary: Returns all the langauge attributes on a web page Function GetLanguages(ByVal URL As String) Dim Languages As Variant Dim Matches As Object Dim N As Long Dim RegExp As Object Dim Request As Object Dim Text As String On Error Resume Next Set Request = CreateObject("WinHttp.WinHttpRequest.5.1") If Request Is Nothing Then Set Request = CreateObject("WinHttp.WinHttpRequest.5") End If Err.Clear On Error GoTo 0 Request.Open "GET", URL, False Request.Send Text = Request.responsetext ' Parse out the language attributes for either XML or HTML Set RegExp = CreateObject("VBScript.RegExp") RegExp.Global = True RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s" ' Display the number of matching attributes and assign first match to the variable Lang If RegExp.Test(Text) Then ' Return a collection of all matches Set Matches = RegExp.Execute(Text) For Each Match In Matches N = N + 1 Languages = Languages & Match & "," Next Match End If GetLanguages = Left(Languages, Len(Languages) - 1) End Function
Calling the Function on the Worksheet
' Column "A" has the URLs, Column "B" this UDF =GetLanguage(A1) =GetLanguage(A2) =GetLanguage(A3)
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
It worked perfectly!
I will try the collect the other infos and will tell you how it worked.
Thank you so much!
Hello Ru1,
Glad to hear the good news. I will be looking forward to your next results.
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
Hello Leith
Do you think it would be possible to use the original code to obtain the information from saved pages (or saved source code), instead of the live online page?
I may have to "freeze" the results and I think this would be a good way to do it.
Hello, Leith:
I managed to retrieve some of the infos I needed like the content of <noscript>, etc.
Unfortunately, I have difficulties retrieving
- The content of the ALT attribute
- The occurrence of <b>, <i>, <u>, <em>, <strong>
- The occurrence of headers <h1> to <h6>
- The occurences of onmouseover, onfocus, onblur, onclick, onkeypress, onmouseout
:-(
Also, as the analysis takes time, it would be very interesting to do the analysis on a sample from day X, something like - the sample as it was in 25 october, the analysis done from 25 october to 10 november". But for this, the code would read from the saved pages.
Hello Leith:
Through search, I realised that there are programs that do something similar to your code but they are normally used to retrieve data. I downloaded three: Data Extractor, DEiXTo (a greek program) and one aptly named Extract Data & Text From Multiple Web Sites Software. And even a Firefox extra (OutWit Hub Light).
Unfortunately, I cannot enter the parameters to search correctly so I'm stuck.
I would ask for help but I think the number of things I ask for demand a lot of work.
So I understand perfectly if you cannot answer my request.
Anyway, I must thank you for your enormous help in some of my needs and to make me start to think in the possibilities (and even learn a little - because I cannot learn more, not because of the excellent teacher). Thank you
Hello Ru1,
The macros I have written are for specific searches and were not intended to perform a broad range of search options. However, I have been working on such a program but it is far from complete.
Sincerely,
Leith Ross
Remember To Do the Following....
1. Use code tags. Place [CODE] before the first line of code and [/CODE] after the last line of code.2. Thank those who have helped you by clicking the Starbelow the post.
3. Please mark your post [SOLVED] if it has been answered satisfactorily.
Old Scottish Proverb...
Luathaid gu deanamh maille! (Rushing causes delays!)
Yes, I understand that these macros are specific.
Thus, I tried to change the search parameters in the first code and I succeeded in some cases. Then I tried to understand RegExp.Pattern = "([xml\:l|\sl]ang="".{2,9}(?:""))\s" and went to the Microsoft help to try to understand this Reg.Exp but to no avail.
Does your program search some of the strings I need? In some cases, I can see patterns like: if it finds ALT, it returns the text surrounded by quotes that follows and I could change these to suit my needs. Also, in some cases I just need to search for strings of text - if it finds "onmouseover", it puts this in a column.
If it has these capabilities and is usable for someone with basic skills, I wouli appreciate it.
This thread has been very helpful! I was able to use Leith's code to extract data from a web page successfully when I use StartTag = <a single tag>. However, the ScrapeData subroutine stopped working when I tried to search for nested tags.
Example:
<div class="column secondary review-body">
<p class="review-text">
<span>
Need to retrieve this data.
</span>
Any thoughts on how I can get the code to work with nested tags?
Thanks!
LJ
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks