Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
817 views
in Technique[技术] by (71.8m points)

vba - Extracting data to excel from multiple website pages

I'm trying to get the data from:

"http://www.css.ethz.ch/en/services/css-partners.html?page=1" to "...page=180"

(2691 results/180 pages) into Excel as three columns (name, country, description) as a one off to hold the same information locally and make it quicker to search.

I figure VBA could do this easily, but I'm totally new to it and don't really know where to start :S Any pointers appreciated!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I've set up something like this at work.

I used this reference. I recommend you read it.

Preparation:

  1. Get the computed HTML of the page(s) section(s) you're targeting (i.e. use F12 developer console) to understand the structure of it.

    <div class="articleBox navigation">
      <!-- ... -->
      <article>
        <div>
          <a href="css-partners/partner.html/100775">Aarhus University (AU)<span class="icon"></span></a>
        </div>
        <div class="nav-hint bold author">Denmark</div>
        <div>Aarhus University (AU) is an academically diverse and research-oriented institution that works to solve the complex developmental challenges facing the world.</div>
      </article>
      <!-- ... -->
    </div>
    
  2. It is best if you already understand the Document Object Model and how you traverse it with JavaScript, specifically using query selectors, child nodes and so on; the Microsoft IE interface somewhat mirrors it. e.g. in JavaScript:

    var articles = document.querySelectorAll("div.articleBox.navigation > article")
    
  3. Add references to "Microsoft Internet Controls" and "Microsoft HTML Object Library" to your VB project.

The sub:

  1. Initialise and open Internet Explorer in memory.

    Dim ie as New InternetExplorer
    
  2. Navigate to the page.

    ie.Navigate "http://www.css.ethz.ch/en/services/css-partners.html?page=1"
    
  3. Wait until the page has loaded.

    Do While ie.ReadyState <> READYSTATE_COMPLETE
        DoEvents
    Loop
    
  4. Traverse the Document Object Model of the page and store relevant details as required.

    Dim articles As IHTMLDOMChildrenCollection
    Dim article As IHTMLElement
    Dim divs As IHTMLElementCollection
    
    ...
    
    Set articles = ie.Document.querySelectorAll("div.articleBox.navigation > article")
    Set article = articles(0)
    Set divs = article.Children
    
  5. Write the relevant details to a range.

    Range("A1") = divs(0).innerText
    Range("B1") = divs(1).innerText
    Range("C1") = divs(2).innerText
    
  6. Loop within article elements and loop pages (not shown).

  7. Close and destroy instance of Internet Explorer.

    ie.Quit
    Set ie = Nothing
    

Put together:

Sub GetSearchResults()

    Dim ie As New InternetExplorer
    Dim articles As IHTMLDOMChildrenCollection
    Dim article As IHTMLElement
    Dim divs As IHTMLElementCollection

    ie.Navigate "http://www.css.ethz.ch/en/services/css-partners.html?page=1"

    Do While ie.ReadyState <> READYSTATE_COMPLETE
        DoEvents
    Loop

    Set articles = ie.Document.querySelectorAll("div.articleBox.navigation > article")
    Set article = articles(0)
    Set divs = article.Children

    Range("A1") = divs(0).innerText
    Range("B1") = divs(1).innerText
    Range("C1") = divs(2).innerText

    ie.Quit
    Set ie = Nothing

End Sub

I leave it as an exercise for you to work out how to loop within the article elements on the page, how to loop within all the pages you want to target, and how to write the information extracted to the appropriate Ranges in Excel.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...