I am "attempting" to scrape a web page that has the following structures within the page:
<p class="row">
<span>stuff here</span>
<a href="http://www.host.tld/file.html">Descriptive Link Text</a>
<div>Link Description Here</div>
</p>
I am scraping the webpage using curl:
<?php
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
?>
I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo $printString . "<br>";
}
Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo "<a href="<extracted link>">LINK</a> " . $printString . "<br>";
}
As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…