Use spatial blocks to separate train and test folds
This function creates spatially separated folds based on a pre-specified distance. It assigns blocks to the training and
testing folds randomly, systematically or in a checkerboard pattern. The distance (theRange
)
should be in metres, regardless of the unit of the reference system of
the input data (for more information see the details section). By default,
the function creates blocks according to the extent and shape of the study area, assuming that the user has considered the
landscape for the given species and case study. Alternatively, blocks can solely be created based on species spatial data.
Blocks can also be offset so the origin is not at the outer
corner of the rasters. Instead of providing a distance, the blocks can also be created by specifying a number of rows and/or
columns and divide the study area into vertical or horizontal bins, as presented in Wenger & Olden (2012) and Bahn & McGill (2012).
Finally, the blocks can be specified by a user-defined spatial polygon layer.
spatialBlock( speciesData, species = NULL, blocks = NULL, rasterLayer = NULL, theRange = NULL, rows = NULL, cols = NULL, k = 5L, selection = "random", iteration = 100L, numLimit = 0L, maskBySpecies = TRUE, degMetre = 111325, border = NULL, showBlocks = TRUE, biomod2Format = TRUE, xOffset = 0, yOffset = 0, progress = TRUE, verbose = TRUE )
speciesData |
A simple features (sf) or SpatialPoints object containing species data (response variable). |
species |
Character (optional). Indicating the name of the column in which species data (response variable e.g. 0s and 1s) is stored.
This argument is used to make folds with evenly distributed records. This option only works by random fold selection and with binary or
multi-class responses e.g. species presence-absence/background or land cover classes for remote sensing image classification.
If |
blocks |
A sf or SpatialPolygons object to be used as the blocks (optional). This can be a user defined polygon and it must cover all the species points. |
rasterLayer |
A raster object for visualisation (optional). If provided, this will be used to specify the blocks covering the area. |
theRange |
Numeric value of the specified range by which blocks are created and training/testing data are separated.
This distance should be in metres. The range could be explored by |
rows |
Integer value by which the area is divided into latitudinal bins. |
cols |
Integer value by which the area is divided into longitudinal bins. |
k |
Integer value. The number of desired folds for cross-validation. The default is |
selection |
Type of assignment of blocks into folds. Can be random (default), systematic or checkerboard. The checkerboard does not work with user-defined spatial blocks. |
iteration |
Integer value. The number of attempts to create folds that fulfil the set requirement for minimum number
of points in each trainig and testing fold (for each response class e.g. train_0, train_1, test_0
and test_1), as specified by |
numLimit |
Integer value. The minimum number of points in each training and testing folds.
If |
maskBySpecies |
Since version 1.1, this option is always set to |
degMetre |
Integer. The conversion rate of metres to degree. See the details section for more information. |
border |
A sf or SpatialPolygons object to clip the block based on it (optional). |
showBlocks |
Logical. If TRUE the final blocks with fold numbers will be created with ggplot and plotted. A raster layer could be specified
in |
biomod2Format |
Logical. Creates a matrix of folds that can be directly used in the biomod2 package as a DataSplitTable for cross-validation. |
xOffset |
Numeric value between 0 and 1 for shifting the blocks horizontally. The value is the proportion of block size. |
yOffset |
Numeric value between 0 and 1 for shifting the blocks vertically. The value is the proportion of block size. |
progress |
Logical. If TRUE shows a progress bar when |
verbose |
Logical. To print the report of the recods per fold. |
To keep the consistency, all the functions use metres as their unit. In this function, when the input map
has geographic coordinate system (decimal degrees), the block size is calculated based on deviding theRange
by
111325 (the standard distance of a degree in metres, on the Equator) to change the unit to degree. This value is optional
and can be changed by user via degMetre
argument.
The xOffset
and yOffset
can be used to change the spatial position of the blocks. It can also be used to
assess the sensitivity of analysis results to shifting in the blocking arrangements. These options are available when theRange
is defined. By default the region is located in the middle of the blocks and by setting the offsets, the blocks will shift.
Roberts et. al. (2017) suggest that blocks should be substantially bigger than the range of spatial
autocorrelation (in model residual) to obtain realistic error estimates, while a buffer with the size of
the spatial autocorrelation range would result in a good estimation of error. This is because of the so-called
edge effect (O'Sullivan & Unwin, 2014), whereby points located on the edges of the blocks of opposite sets are
not separated spatially. Blocking with a buffering strategy overcomes this issue (see buffering
).
An object of class S3. A list of objects including:
folds - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
foldID - a vector of values indicating the number of the fold for each observation (each number corresponds to the same point in species data)
biomodTable - a matrix with the folds to be used in biomod2 package
k - number of the folds
blocks - SpatialPolygon of the blocks
range - the distance band of separating trainig and testing folds, if provided
species - the name of the species (column), if provided
plots - ggplot object
records - a table with the number of points in each category of training and testing
Bahn, V., & McGill, B. J. (2012). Testing the predictive performance of distribution models. Oikos, 122(3), 321-331.
O'Sullivan, D., Unwin, D.J., (2010). Geographic Information Analysis, 2nd ed. John Wiley & Sons.
Roberts et al., (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.
Wenger, S.J., Olden, J.D., (2012). Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods Ecol. Evol. 3, 260-267.
spatialAutoRange
and rangeExplorer
for selecting block size; buffering
and envBlock
for alternative blocking strategies; foldExplorer
for visualisation of the generated folds.
For DataSplitTable see BIOMOD_cv
in biomod2 package
# load package data library(sf) awt <- raster::brick(system.file("extdata", "awt.grd", package = "blockCV")) # import presence-absence species data PA <- read.csv(system.file("extdata", "PA.csv", package = "blockCV")) # make a sf object from data.frame pa_data <- sf::st_as_sf(PA, coords = c("x", "y"), crs = raster::crs(awt)) # spatial blocking by specified range and random assignment sb1 <- spatialBlock(speciesData = pa_data, species = "Species", theRange = 70000, k = 5, selection = "random", iteration = 100, numLimit = NULL, biomod2Format = TRUE, xOffset = 0.3, # shift the blocks horizontally yOffset = 0) # spatial blocking by row/column and systematic fold assignment sb2 <- spatialBlock(speciesData = pa_data, species = "Species", rasterLayer = awt, rows = 5, cols = 8, k = 5, selection = "systematic", biomod2Format = TRUE)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.