I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt
.
The site I want to crawl has a URL similar to this:
http://www.example.com/foo.cfm
On that page there are numerous links that match the following pattern:
http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976
I want to crawl links that match second example above as well. In my regex-urlfilter.txt
I have the following:
+^http://www.example.com/foo.cfm$
+^http://www.example.com/foo.cfm/(.+)*$
Nutch matches on the first one and crawls it correctly, but does not seem to pick up links using the other filter. How can I get Nutch to crawl URL's like the second one above?
I have tried the following with no luck:
+^http://www.example.com/foo.cfm/(.+)*$
+^http://www.example.com/foo.cfm/(.)*$
+^http://www.example.com/foo.cfm/.+$
+^http://www.example.com/foo.cfm/(.*)*$
In my NUTCH_ROOT/urls/nutch
I have:
http://www.example.com/foo.cfm/
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…