Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
451 views
in Technique[技术] by (71.8m points)

python - 刮板不刮板字段“说明”(scraper not scraping field “Description”)

I have a web scraper coded for me using scrapy.

(我有一个使用scrapy为我编写的网络刮板。)

I wish to add an extra field from the website the scraper is scraping from.

(我希望在刮板正在刮擦的网站上添加一个额外的字段。)

The column header "Description" is created in the CSV database but nothing is scraped.

(在CSV数据库中创建了列标题“描述”,但未刮任何内容。)

# -*- coding: utf-8 -*-
import scrapy
from pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import csv,re
from scrapy import signals
class Rapid7(scrapy.Spider):
    name = 'vulns'
    allowed_domains = ['rapid7.com']
    main_url = 'https://www.rapid7.com/db/?q=&type=nexpose&page={}'
    #start_urls = ['https://www.rapid7.com/db/vulnerabilities']
    keys = ['Published','CVEID', 'Added', 'Modified', 'Related', 'Severity', 'CVSS', 'Created', 'Solution', 'References', 'Description', 'URL']
    def __init__(self):
        SignalManager(dispatcher.Any).connect(receiver=self._close, signal=signals.spider_closed)
    def start_requests(self):
        for i in range(1,10):
            url = self.main_url.format(i)
            yield scrapy.Request(url,callback=self.parse)
    def parse(self, response):
        flag = True
        temp = response.xpath('//div[@class="vulndb__intro-content"]/p/text()').extract_first()
        if temp:
            if temp.strip()=='An error occurred.':
                flag= False
        temp = [i for i in response.xpath('//*[@class="results-info"]/parent::div/p/text()').extract()if i.strip()]
        if len(temp)==1:
            flag= False
        if flag:
            for article in response.xpath('//*[@class="vulndb__results"]/a/@href').extract():
                yield scrapy.Request(response.urljoin(article), callback=self.parse_article, dont_filter=True)

    def parse_article(self,response):
        item=dict()
        item['Published'] = item['Added'] = item['Modified'] = item['Related'] = item['Severity'] = item['Description'] =''
        r=response.xpath('//h1[text()="Related Vulnerabilities"]/..//a/@href').extract()
        temp = response.xpath('//meta[@property="og:title"]/@content').extract_first()
        item['CVEID'] = ''
        try:
            temp2 = re.search('(CVE-.*-d*)',temp).groups()[0]
            if ":" in temp2:
                raise KeyError
        except:
            try:
                temp2 = re.search('(CVE-.*):',temp).groups()[0]
            except:
                temp2 = ''
        if temp2:
            item['CVEID'] = temp2.replace(': Important',"").replace(')','')
        table = response.xpath('//section[@class="tableblock"]/div')
        for row in table:
            header = row.xpath('header/text()').extract_first()
            data = row.xpath('div/text()').extract_first()
            item[header]=data
        temp = [i for i in response.xpath('//div[@class="vulndb__related-content"]//text()').extract() if i.strip()]
        for ind,i in enumerate(temp):
            if "CVE" in i:
                temp[ind] = i.replace(' ','')

        item['Related']= ", ".join(temp) if temp else ""
        temp2= [i for i in response.xpath('//h4[text()="Solution(s)"]/parent::*/ul/li/text()').extract() if i.strip()]
        item['Solution'] =", ".join(temp2) if temp2 else ''
        temp3 = [i for i in response.xpath('//h4[text()="References"]/parent::*/ul/li/text()').extract() if i.strip()]
        item['References'] = ", ".join(temp3) if temp3 else ''
        temp4 = [i for i in response.xpath('//h4[text()="Description"]/parent::*/ul/li/text()').extract() if i.strip()]
        item['Description'] = ", ".join(temp4) if temp4 else ''
        item['URL'] = response.request.url
        new_item=dict()
        for key in self.keys:
            if key not in list(item.keys()):
                new_item[key] = ''
            else:
                new_item[key]=item[key]
        yield new_item

    def _close(self):
        print("Done Scraping")

Thanks

(谢谢)

"It looks like your post is mostly code; please add some more details."

(“看起来您的帖子主要是代码;请添加更多详细信息。”)

Sorry.

(抱歉。)

:( "It looks like your post is mostly code; please add some more details." Sorry. :(

(:(“看来您的帖子大部分是代码;请添加更多详细信息。”对不起。:()

  ask by Davey Boy translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...