Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
767 views
in Technique[技术] by (71.8m points)

web scraping - How to automate multiple requests to a web search form using R

I'm trying to learn how to use RCurl (or some other suitable R package if I'm wrong about RCurl being the right tool) to automate the process of submitting search terms to a web form and placing the search results in a data file. The specific problem I'm working on is as follows:

I have a data file giving license plate number (LPN) and vehicle identification number (VIN) for several automobiles. The California Department of Motor Vehicles (DMV) has a web page search form where you enter the LPN and the last five digits of the VIN, and it returns the vehicle license fee (VLF) payment for either 2010 or 2009 (there's a selector for that on the input form as well). (FYI: This is for a research project to look at the distribution of VLF payments by vehicle make, model and model year)

I could go through the tedious process of manually entering data for each vehicle and then manually typing the result into a spreadsheet. But this is the 21st Century and I'd like to try and automate the process. I want to write a script that will submit each LPN and VIN to the DMV web form and then put the result (the VLF payment) in a new VLF variable in my data file, doing this repeatedly until it gets to the end of the list of LPNs and VINs. (The DMV web form is here by the way: https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do).

My plan was to use getHTMLFormDescription() (in the RHTMLForms package) to find out the names of the input fields and then use getForm() or postForm() (in the RCurl package) to retrieve the output. Unfortunately, I got stuck at the very first step. Here's the R command I used and the output:

> forms = getHTMLFormDescription("https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do")
Error in htmlParse(url, ...) : 
  File https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do does not exist 

Unfortunately, being relatively new to R and almost completely new to HTTP and web-scraping, I'm not sure what to do next.

First, does anyone know why I'm getting an error on my getHTMLFormDescription() call? Alternatively, is there another way to figure out the names of the input fields?

Second, can you suggest some sample code to help me get started on actually submitting LPNs and VINs and retrieving the output? Is getForm() or postForm() the right approach or should I be doing something else? If it would help to have some real LPN-VIN combinations to submit, here are three:
LPN VIN
5MXH018 30135
4TOL562 74735
5CWR968 11802

Finally, since you can see I'm a complete novice at this, do you have suggestions on what I need to learn in order to become adept at web scraping of this sort and how to go about learning it (in R or in another language)? Specific suggestions for web sites, books, listservs, other StackOverflow questions, etc. would be great.

Thanks for your help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Adding to the suggestion by daroczig and Rguy, here is a short piece of code to automate the entire process of extracting the data into a data frame.

# construct sample data frame with lpn, vpn and years
lpn  = rep(c('5MXH018', '4TOL562', '5CWR968'), 2);
vpn  = rep(c('30135', '74735', '11802'), 2);
year = c(rep(2009, 3), rep(2010, 3));
mydf = data.frame(lpn, vpn, year);

# construct function to extract data for one record
get_data = function(df){

  library(XML);
  # root url
  root = 'http://www.dmv.ca.gov/wasapp/FeeCalculatorWeb/vlfFees.do?method=calculateVlf&su%C2%ADbmit=Determine%20VLF'

  # construct url by adding lpn, year and vpn
  u = paste(root, '&vehicleLicense=', df$lpn, '&vehicleTaxYear=', 
            df$year, '&vehicleVin=',
      df$vpn, sep = "");

  # encode url correctly
  url  = URLencode(u);

  # extract data from the right table
  data = readHTMLTable(url)[[5]];

}

# apply function to every row of mydf and return data frame of results
library(plyr)
mydata = adply(mydf, 1, get_data);

# remove junk from column names
names(mydata) = gsub(':302240302240', '', names(mydata))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...