Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
958 views
in Technique[技术] by (71.8m points)

jquery - Scraping javascript generated content using request in Node.Js

I need to scrape some content from Google search results that only shows in browsers (I suspect it's when Javascript is enabled) –?specifically, their Knowledge Graph "People also search for" content.

I use a combination of request and cheerio to scrape and has already managed to force-load results from .com domain, however, the knowledgebase box does not show up in the body of my results, probably because it's javascript-generated content.

Anybody knows if there's a setting I could add or another library I could use?

Here's my code below. Thank you!

var request = require('request');
var cheerio = require("cheerio");

request = request.defaults({jar: true});

var options = {
    url: 'http://www.google.com/ncr',
    headers: {
        'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16'
    }
};

request(options, function () {

    request('https://www.google.com/search?gws_rd=ssl&site=&source=hp&q=google&oq=google', function (error, response, body) {

        var $ = cheerio.load(body);

        $("li").each(function() {
            var link = $(this);
            var text = link.text();

            console.log(text);
        });
    });
});
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can't using node's request as you are merely downloading the static content. In order to render JavaScript you have to use a browser. Fortunately there are headless browsers just for this purpose. I suggest PhantomJS.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...