(Windows 7 / R version 3.0.1)
Below the commands and the resulting error:
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:UsersRaffaelAppDataLocalTemp
RtmpS8Uql1pdfinfo167c2bc159f8': No such file or directory
How do I solve this issue?
EDIT I
(As suggested by Ben and described here)
I downloaded Xpdf copied the 32bit version to
C:Program Files (x86)xpdf32
and the 64bit version to
C:Program Filesxpdf64
The environment variables pdfinfo
and pdftotext
are referring to the respective executables either 32bit (tested with R 32bit) or to 64bit (tested with R 64bit)
EDIT II
One very confusing observation is that starting from a fresh session (tm not loaded) the last command alone will produce the error:
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:UsersRaffaelAppDataLocalTempRtmpKi5GnL
pdfinfode8283c422f': No such file or directory
I don't understand this at all because the function variable is not defined by tm.readPDF yet. Below you'll find the function pdf refers to "naturally" and to what is returned by tm.readPDF:
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0674bd8c>
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0c3d7364>
Apparently there is no difference - then why use readPDF at all?
EDIT III
The pdf file is located here: C:UsersRaffaelDocuments
> getwd()
[1] "C:/Users/Raffael/Documents"
EDIT IV
First instruction in pdf()
is a call to tm:::pdfinfo()
- and there the error is caused within the first few lines:
> outfile <- tempfile("pdfinfo")
> on.exit(unlink(outfile))
> status <- system2("pdfinfo", shQuote(normalizePath("C:/Users/Raffael/Documents/17214.pdf")),
+ stdout = outfile)
> tags <- c("Title", "Subject", "Keywords", "Author", "Creator",
+ "Producer", "CreationDate", "ModDate", "Tagged", "Form",
+ "Pages", "Encrypted", "Page size", "File size", "Optimized",
+ "PDF version")
> re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:",
+ tags)), collapse = "|"))
> lines <- readLines(outfile, warn = FALSE)
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:UsersRaffaelAppDataLocalTempRtmpquRYX6pdfinfo8d419174450': No such file or direc
Apparently tempfile()
simply doesn't create a file.
> outfile <- tempfile("pdfinfo")
> outfile
[1] "C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6\pdfinfo8d437bd65d9"
The folder C:UsersRaffaelAppDataLocalTempRtmpquRYX6
exists and holds some files but none is named pdfinfo8d437bd65d9
.
See Question&Answers more detail:
os