Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
373 views
in Technique[技术] by (71.8m points)

web scraping - Python - Extracting data between specific comment nodes with BeautifulSoup 4

Looking to pick out specific data from a website such as prices, company info etc. Luckily, the website designer has put lots of tags such as

<!-- Begin Services Table -->
' desired data
<!-- End Services Table -->

What kind of code would I need in order for BS4 to return the strings between the given tags?

import requests
from bs4 import BeautifulSoup

url = "http://www.100ll.com/searchresults.phpclear_previous=true&searchfor="+'KPLN'+"&submit.x=0&submit.y=0"

response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

text_list = soup.find(id="framediv").find_all(text=True)
start_index = text_list.index(' Begin Fuel Information Table ') + 1
end_index = text_list.index(' End Fuel Information Table ')
for item in text_list[start_index:end_index]:
    print(item)

Here's the website in question:

http://www.100ll.com/showfbo.php?HashID=cf5f18404c062da6fa11e3af41358873

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If you want to select the table element after those specific comment(s), then you can select all the comment nodes, filter them based on the desired text, and then select the the next sibling table element:

import requests
from bs4 import BeautifulSoup
from bs4 import Comment

response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

comments = soup.find_all(string=lambda text:isinstance(text,Comment))

for comment in comments:
    if comment.strip() == 'Begin Services Table':
        table = comment.find_next_sibling('table')
        print(table)

Alternatively, if you want to get all data between those two comments, then you could find the first comment and then iterate over all the next siblings until you find the closing comment:

import requests
from bs4 import BeautifulSoup
from bs4 import Comment

response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

data = []

for comment in soup.find_all(string=lambda text:isinstance(text, Comment)):
    if comment.strip() == 'Begin Services Table':
        next_node = comment.next_sibling

        while next_node and next_node.next_sibling:
            data.append(next_node)
            next_node = next_node.next_sibling

            if not next_node.name and next_node.strip() == 'End Services Table': break;

print(data)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...