Along with having a flexible toolkit, data science regularly requires out-of-the-box thinking (at least in my profession).
But, first, a thing about PDF files.
I don't think they are what you think they are. "Bold" (or "italic", etc.) isn't "metadata". You should spend some time reading up on PDF files because they are complex, nasty, evil things that you are likely to encounter often when working with data. Read this — https://stackoverflow.com/a/19777953/1457051 — to see what finding bold text actually entails (follow the link to the 1.8.x Java pdfbox
solution).
Back to our irregularly scheduled answering
While I'm one of the YUGEst proponents of R, not everything needs to be done or should be done in R. Sure, we'll use R to eventually get your bold text but we'll use a helper command-line utility to do so.
The pdftools
package is based on the poppler
library. It comes with the source so "I'm just an R user" folks likely don't have the full poppler
toolset on their system.
Mac folks can use Homebrew to (once you get Homebrew setup):
Linux folks know how to do things. Windows folks are lost forever (there are poppler binaries for you, but your time would be better spent switching to a real operating system).
Once you do that, you can use the below to achieve your goal.
First, we'll make a helper function with lots of safety bumpers:
#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#' They should be supplied as you would supply arguments to the `args`
#' parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {
# make sure poppler/pdftohtml is installed
pdftohtml <- Sys.which("pdftohtml")
if (pdftohtml == "") {
stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
}
# make sure the file exists
path <- path.expand(path)
stopifnot(file.exists(path))
# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")
# get by with a little help from our friends
suppressPackageStartupMessages({
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})
# we're going to do the conversion in a temp directory space
td <- tempfile(fileext = "_dir")
dir.create(td)
on.exit(unlink(td, recursive=TRUE), add=TRUE)
# save our current working directory
curwd <- getwd()
on.exit(setwd(curwd), add=TRUE)
# move to the temp space
setwd(td)
file.copy(path, td)
# collect the extra arguments
c(
"-i" # ignore images
) -> args
args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html
# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")
# we'll let stderr display so you can debug errors
system2(
command = pdftohtml,
args = args,
stdout = TRUE
) -> res
res <- gsub("^Page-", "", res[length(res)])
message("Converted ", res, " pages")
# this will need to be changed if poppler ever does anything different
xml2::read_html("r-docs.html")
}
Now, we'll use it:
doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")
bold_tags <- html_nodes(doc, xpath=".//b")
bold_words <- html_text(bold_tags)
head(bold_words, 20)
## [1] "Preamble"
## [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
## [3] "History"
## [4] "Ancient and Medieval Period"
## [5] "The Introduction of English Law Into India"
## [6] "Mofussal Courts"
## [7] "Legislation"
## [8] "The Indian Contract Act 1872"
## [9] "The Making of the Act"
## [10] "Law of Contract Until 1950"
## [11] "The Law of Contract after 1950"
## [12] "Amendments to This Act"
## [13] "Other Laws Affecting Contracts and Enforcement"
## [14] "Recommendations of the Indian Law Commission"
## [15] "Section 1."
## [16] "Short title"
## [17] "Extent, Commencement."
## [18] "Enactments Repealed."
## [19] "Applicability of the Act"
## [20] "Scheme of the Act"
length(bold_words)
## [1] 1939
No Java required at all and you've got your bold words.
If you do want to go the pdfbox-app
route as Ralf noted, you can use this wrapper to make it easier to work with:
read_pdf_as_html_with_pdfbox <- function(path) {
java <- Sys.which("java")
if (java == "") {
stop("Java binary is not on the system PATH.", call.=FALSE)
}
# get by with a little help from our friends
suppressPackageStartupMessages({
library(httr, warn.conflicts = FALSE, quietly = TRUE)
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})
path <- path.expand(path)
stopifnot(file.exists(path))
# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")
# download the pdfbox "app" if not installed
if (!dir.exists("~/.pdfboxjars")) {
message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
dir.create("~/.pdfboxjars")
httr::GET(
url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
httr::progress()
) -> res
httr::stop_for_status(res)
}
# we're going to do the conversion in a temp directory space
tf <- tempfile(fileext = ".html")
on.exit(unlink(tf), add=TRUE)
c(
"-jar",
path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
"ExtractText",
"-html",
path,
tf
) -> args
# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")
system2(
command = java,
args = args
) -> res
xml2::read_html(tf)
}