Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.0k views
in Technique[技术] by (71.8m points)

powershell - Parse local HTML file

I can use PowerShell to parse an HTML page

PS > $foo = Invoke-WebRequest http://example.com

PS > $foo.Links.Count
1

However if I download the page

PS > Invoke-WebRequest -OutFile example.htm http://example.com

and then try to parse the downloaded page it gives unexpected result

PS > $foo = Invoke-WebRequest file://$pwd/example.htm

PS > $foo.Links.Count
0

How can I parse the local downloaded page?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It appears that Invoke-WebRequest loads file protocol URIs just fine, but fails to parse them even in PowerShell 4.0 (where it is officially supported).

An alternative that does not require setting up a website would be to load and parse HTML directly into MSHTML.

$html = New-Object -ComObject "HTMLFile";
$source = Get-Content -Path "file.html" -Raw;
$html.IHTMLDocument2_write($source);

$html.links.length;

Note that when I tested this, a single

<meta http-equiv="X-UA-Compatible" content="IE=edge" />

header prevented my HTML from parsing and I have no idea why -- the document had similar XHTML-style headers and MSHTML had no issues with those.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...