Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
274 views
in Technique[技术] by (71.8m points)

php - How to add scraped website data in database?

I want to store:

  1. Product Name
  2. Categoty
  3. Subcategory
  4. Price
  5. Product Company.

In my table named products_data with filds name as PID, product_name, category, subcategory, product_price and product_company.

I am using curl_init() function in php to first scrap website URL, next I want to store products data in my database table. Here is what I have done so far for this:

$sites[0] = 'http://www.babyoye.com/';

foreach ($sites as $site)
{
    $ch = curl_init($site);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $html = curl_exec($ch);

    $title_start = '<div class="info">';

    $parts = explode($title_start,$html);
    foreach($parts as $part){
        $link = explode('<a href="/d/', $part);

        $link = explode('">', $link[1]);
        $url = 'http://www.babyoye.com/d/'.$link[0];

        // now for the title we need to follow a similar process:

        $title = explode('<h2>', $part);

        $title = explode('</h2>', $title[1]);

        $title = strip_tags($title[0]);

        // INSERT DB CODE HERE e.g.

        $db_conn = mysql_connect('localhost', 'root', '') or die('error');
        mysql_select_db('babyoye', $db_conn) or die(mysql_error());

        $sql = "INSERT INTO products_data(PID, product_name) VALUES ('".$url."', '".$title."')"

        mysql_query($sql) or die(mysql_error()); 

    }
}

I am little confused with database part that how to insert data in table. Any help?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There's a number of things you may wish to consider in your design phase prior to writing some code:

  • Generalise your solutions as much as you can. If you have to write PHP code for every new scrape, your development changes required if a target site changes their layout may be too slow, and may disrupt the enterprise you are building. This is extra-important if you intend to scrape a large number of sites, since the odds of a site restructuring are statistically greater.
  • One way to achieve this generalisation is to use off-the-shelf libraries that are good at this already. So, rather than using cURL, use Goutte or some other programmatic browser system. This will give you sessions for free, which in some sites is necessary to click from one page to another. You'll also get CSS selectors to specify what items of content you are interested in.
  • For tabular content, store a look-up database table on your local site, that converts a heading title to a database column name. For product grids, you could use a table to convert a CSS selector (relative to each grid cell, say) to a column. Either of these will make it easier to respond to changes in the format of your target site(s).
  • If you are extracting text from a site, at a minimum you need to run it through a proper escape system, otherwise a target site could in theory add content on their site to inject SQL of their choosing into your database. In any case, an apostrophe on their side would certainly cause your call to fail, so you should use mysql_real_escape_string.
  • If you are extracting HTML from a site with view to re-displaying it, always remember to clean it properly first. This means stripping tags that you don't want, removing attributes that may be unwelcome, and ensuring the structure is well-nested. HTMLPurifier is good for this, I've found.

When crawling, remember:

  • Be a good robot and define a unique USER_AGENT for yourself, so site operators are easily block you if they wish. It is poor etiquette to masquerade as a human using, say, Internet Explorer. Include a URL to a friendly help page in your user agent, like the GoogleBot does.
  • Don't crawl through proxies or other systems intended to hide your identity - crawl in the open.
  • Respect robots.txt; if a site wishes to block scrapers, they should be allowed to do so using respected conventions. If you are acting like a search engine, the odds of an operator wishing to block you are very low (don't most people want to be scraped by search engines?)
  • Always do some rate limiting, otherwise this happens. On my development laptop through a slow connection, I can scrape a site at a rate of two pages a second, even without using multi_curl. On a real server, that's likely to be much faster - maybe 20? Either way, making that number of requests of one target IP/domain is a great way to find yourself in someone's blocklist. Thus, if you scrape, do it slowly.
  • I maintain a table of HTTP accesses, and have a rule that if I've made a request in the last 5 seconds, I "pause" this scrape, and scrape something else instead. I come back to paused scrapes once sufficient time has passed. I may be inclined to increase this value, and hold the concurrent state of a larger number of paused operations in memory.
  • If you are scraping a number of sites, one way to maintain performance without sleeping excessively is to interleave the requests you wish to make on a round-robin basis. So, do one HTTP operation each on 50 sites, retain the state of each scrape, and then go back to the first one.
  • If you implement the interleaving of many sites, you can use multi_curl to parallelise your HTTP requests. I wouldn't recommend using this on a single site for reasons already stated (the remote server may well limit the number of connections you can separately open to them anyway).
  • Be careful about basing your entire enterprise on the scraping of a single site. If they block you, you're fairly stuck. If your business model can rely on the scraping of many sites, then being blocked by one becomes less of a risk.

Also, it may be cost-effect to install third party scraping software, or get a third-party service to do the scraping for you. My own research in this area has turned up very few organisations that appear to be capable (and bear in mind that, at the time of writing, I've not tried any of them). So, you may wish to look at these:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...