Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

getHTMLLinks

Get links or names of external files in HTML document


Description

These functions allow us to retrieve either the links within an HTML document, or the collection of names of external files referenced in an HTML document. The external files include images, JavaScript and CSS documents.

Usage

getHTMLLinks(doc, externalOnly = TRUE, xpQuery = "//a/@href",
               baseURL = docName(doc), relative = FALSE)
getHTMLExternalFiles(doc, xpQuery = c("//img/@src", "//link/@href",
                                      "//script/@href", "//embed/@src"),
                     baseURL = docName(doc), relative = FALSE,
                     asNodes = FALSE, recursive = FALSE)

Arguments

doc

the HTML document as a URL, local file name, parsed document or an XML/HTML node

externalOnly

a logical value that indicates whether we should only return links to external documents and not references to internal anchors/nodes within this document, i.e. those that of the form #foo.

xpQuery

a vector of XPath elements which match the elements of interest

baseURL

the URL of the container document. This is used to resolve relative references/links.

relative

a logical value indicating whether to leave the references as relative to the base URL or to expand them to their full paths.

asNodes

a logical value that indicates whether we want the actual HTML/XML nodes in the document that reference external documents or just the names of the external documents.

recursive

a logical value that controls whether we recursively process the external documents we find in the top-level document examining them for their external files.

Value

getHTMLLinks returns a character vector of the links.

getHTMLExternalFiles returns a character vector.

Author(s)

Duncan Temple Lang

See Also

Examples

# site is flaky
  try(getHTMLLinks("http://www.omegahat.net"))

  try(getHTMLLinks("http://www.omegahat.net/RSXML"))

  try(unique(getHTMLExternalFiles("http://www.omegahat.net")))

XML

Tools for Parsing and Generating XML Within R and S-Plus

v3.99-0.10
BSD_3_clause + file LICENSE
Authors
CRAN Team [ctb, cre] (de facto maintainer since 2013), Duncan Temple Lang [aut] (<https://orcid.org/0000-0003-0159-1546>), Tomas Kalibera [ctb]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.