blockCV: spatialBlock – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

spatialBlock

Use spatial blocks to separate train and test folds

Description

This function creates spatially separated folds based on a pre-specified distance. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern. The distance (theRange) should be in metres, regardless of the unit of the reference system of the input data (for more information see the details section). By default, the function creates blocks according to the extent and shape of the study area, assuming that the user has considered the landscape for the given species and case study. Alternatively, blocks can solely be created based on species spatial data. Blocks can also be offset so the origin is not at the outer corner of the rasters. Instead of providing a distance, the blocks can also be created by specifying a number of rows and/or columns and divide the study area into vertical or horizontal bins, as presented in Wenger & Olden (2012) and Bahn & McGill (2012). Finally, the blocks can be specified by a user-defined spatial polygon layer.

Usage

spatialBlock(
  speciesData,
  species = NULL,
  blocks = NULL,
  rasterLayer = NULL,
  theRange = NULL,
  rows = NULL,
  cols = NULL,
  k = 5L,
  selection = "random",
  iteration = 100L,
  numLimit = 0L,
  maskBySpecies = TRUE,
  degMetre = 111325,
  border = NULL,
  showBlocks = TRUE,
  biomod2Format = TRUE,
  xOffset = 0,
  yOffset = 0,
  progress = TRUE,
  verbose = TRUE
)

Arguments

`speciesData`	A simple features (sf) or SpatialPoints object containing species data (response variable).
`species`	Character (optional). Indicating the name of the column in which species data (response variable e.g. 0s and 1s) is stored. This argument is used to make folds with evenly distributed records. This option only works by random fold selection and with binary or multi-class responses e.g. species presence-absence/background or land cover classes for remote sensing image classification. If `speceis = NULL` the response classes will be treated the same and only training and testing records will be counted and balanced.
`blocks`	A sf or SpatialPolygons object to be used as the blocks (optional). This can be a user defined polygon and it must cover all the species points.
`rasterLayer`	A raster object for visualisation (optional). If provided, this will be used to specify the blocks covering the area.
`theRange`	Numeric value of the specified range by which blocks are created and training/testing data are separated. This distance should be in metres. The range could be explored by `spatialAutoRange()` and `rangeExplorer()` functions.
`rows`	Integer value by which the area is divided into latitudinal bins.
`cols`	Integer value by which the area is divided into longitudinal bins.
`k`	Integer value. The number of desired folds for cross-validation. The default is `k = 5`.
`selection`	Type of assignment of blocks into folds. Can be random (default), systematic or checkerboard. The checkerboard does not work with user-defined spatial blocks.
`iteration`	Integer value. The number of attempts to create folds that fulfil the set requirement for minimum number of points in each trainig and testing fold (for each response class e.g. train_0, train_1, test_0 and test_1), as specified by `species` and `numLimit` arguments.
`numLimit`	Integer value. The minimum number of points in each training and testing folds. If `numLimit = 0`, the most evenly dispersed number of records is chosen (given the number of iteration). This option no longer accepts NULL as input. If it is set to NULL, 0 is used instead.
`maskBySpecies`	Since version 1.1, this option is always set to `TRUE`.
`degMetre`	Integer. The conversion rate of metres to degree. See the details section for more information.
`border`	A sf or SpatialPolygons object to clip the block based on it (optional).
`showBlocks`	Logical. If TRUE the final blocks with fold numbers will be created with ggplot and plotted. A raster layer could be specified in `rasterlayer` argument to be as background.
`biomod2Format`	Logical. Creates a matrix of folds that can be directly used in the biomod2 package as a DataSplitTable for cross-validation.
`xOffset`	Numeric value between 0 and 1 for shifting the blocks horizontally. The value is the proportion of block size.
`yOffset`	Numeric value between 0 and 1 for shifting the blocks vertically. The value is the proportion of block size.
`progress`	Logical. If TRUE shows a progress bar when `numLimit = NULL` in random fold selection.
`verbose`	Logical. To print the report of the recods per fold.

Details

To keep the consistency, all the functions use metres as their unit. In this function, when the input map has geographic coordinate system (decimal degrees), the block size is calculated based on deviding theRange by 111325 (the standard distance of a degree in metres, on the Equator) to change the unit to degree. This value is optional and can be changed by user via degMetre argument.

The xOffset and yOffset can be used to change the spatial position of the blocks. It can also be used to assess the sensitivity of analysis results to shifting in the blocking arrangements. These options are available when theRange is defined. By default the region is located in the middle of the blocks and by setting the offsets, the blocks will shift.

Roberts et. al. (2017) suggest that blocks should be substantially bigger than the range of spatial autocorrelation (in model residual) to obtain realistic error estimates, while a buffer with the size of the spatial autocorrelation range would result in a good estimation of error. This is because of the so-called edge effect (O'Sullivan & Unwin, 2014), whereby points located on the edges of the blocks of opposite sets are not separated spatially. Blocking with a buffering strategy overcomes this issue (see buffering).

Value

An object of class S3. A list of objects including:

folds - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
foldID - a vector of values indicating the number of the fold for each observation (each number corresponds to the same point in species data)
biomodTable - a matrix with the folds to be used in biomod2 package
k - number of the folds
blocks - SpatialPolygon of the blocks
range - the distance band of separating trainig and testing folds, if provided
species - the name of the species (column), if provided
plots - ggplot object
records - a table with the number of points in each category of training and testing

References

Bahn, V., & McGill, B. J. (2012). Testing the predictive performance of distribution models. Oikos, 122(3), 321-331.

O'Sullivan, D., Unwin, D.J., (2010). Geographic Information Analysis, 2nd ed. John Wiley & Sons.

Roberts et al., (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.

Wenger, S.J., Olden, J.D., (2012). Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods Ecol. Evol. 3, 260-267.

Examples

# load package data
library(sf)

awt <- raster::brick(system.file("extdata", "awt.grd", package = "blockCV"))
# import presence-absence species data
PA <- read.csv(system.file("extdata", "PA.csv", package = "blockCV"))
# make a sf object from data.frame
pa_data <- sf::st_as_sf(PA, coords = c("x", "y"), crs = raster::crs(awt))

# spatial blocking by specified range and random assignment
sb1 <- spatialBlock(speciesData = pa_data,
                    species = "Species",
                    theRange = 70000,
                    k = 5,
                    selection = "random",
                    iteration = 100,
                    numLimit = NULL,
                    biomod2Format = TRUE,
                    xOffset = 0.3, # shift the blocks horizontally
                    yOffset = 0)

# spatial blocking by row/column and systematic fold assignment
sb2 <- spatialBlock(speciesData = pa_data,
                    species = "Species",
                    rasterLayer = awt,
                    rows = 5,
                    cols = 8,
                    k = 5,
                    selection = "systematic",
                    biomod2Format = TRUE)

blockCV

Spatial and Environmental Blocking for K-Fold Cross-Validation

v2.1.1

GPL-3

Authors

Roozbeh Valavi [aut, cre], Jane Elith [aut], José Lahoz-Monfort [aut], Gurutzeta Guillera-Arroita [aut]

Initial release

2020-02-16