maskRangeR: Improving spatial and temporal accuracy of species distribution models

Cory Merow, Peter J. Galante, Cecina Babich Morrow, Jamie M. Kass, Alex Moore, Valentina Grisales-Betancur, Beth Gerstner, Jorge Velasquez, Robert P. Anderson, Mary E. Blair

2021-03-15

1 Introduction

The goal of masking range maps (or species distribution models) is to improve spatial (and possibly temporal) accuracy by incorporating different types of information. Often statistically-based SDMs are used to identify suitable environmental conditions, as the models detect locations with similar environmental conditions to where the species was observed. Often, these modeled maps are best interpreted as potential distributions since other processes besides abiotic conditions may restrict the distribution, such as dispersal, biotic interactions, social dynamics, historical contingencies, or human modification of habitat. In contrast, realized distributions describe those locations that are actually occupied by the species (at some spatial resolution) and may be smaller than potential distributions. By starting with potential distributions from any form of SDM, we can use other pieces of information to refine these models to approach realized distributions.

Here, we illustrate methods for estimating realized distributions based on different types of data. Additional information not included as a predictor in the SDM can be described by masks that differentiate potential habitat from non-habitat. They represent a conceptual model usually relating to a single variable, e.g., forest cover, occurrence of competing species, dispersal. When appropriate data are not available (e.g. climate data matching occurrence record collection dates), masks can be stacked on existing SDMs to determine places that both models designate as suitable habitat. Using a series of masks is a type of ensemble modeling where complete agreement is required between all component models to denote a location as suitable. Below, we illustrate cases based on forest cover, expert opinion, multiple environmental layers, and occurrence of competing species.

Throughout the package, we use the following definitions:

Potential distribution is used to describe maps used as input. The term ‘potential’ is used because the map is presumed to include possibly suitable locations. These maps may well be the result of a model that includes processes such as dispersal, biotic interactions, social dynamics, historical contingencies, or human modification of habitat, but here we assume that there is further information that can be used to refine the range.

Realized distribution is used to describe output maps resulting from masking of potential distributions. While the outputs may still include some ommission/commission errors, they aim to better estimate the realized distribution relative to the input maps by accounting for some additional source of information, e.g. land cover, expert opinion, biotic factors, etc.

In the following use cases, we describe distribution modeling scenarios that require post-processing of the potential distribution to better estimate a species’ realized distribution. This post-processing is carried out in the form of masks based on expert opinion or data. For each scenario, we list the components needed to carry out the analysis and provide examples of the process.

To download example data used in this vignette, click here

2 Data Driven: Land Cover Mask

2.1 Issue

When creating SDMs, best practice dictates using environmental data that temporally match the occurrence records. However, this is not always realistic. Often, many occurrence records are from museum specimens that do not have relevant environmental data, e.g. forest cover. In such cases, recent occurrence records may be temporally matched to recent environmental data in order create a mask of values outside of observed ranges. This mask can be applied to the SDM to remove those geographic areas that are unsuitable for the species.

2.2 Needs

These analyses require the following data:

  1. SDM: a previously generated species distribution model.
  2. Dated occurrence records: These can be dated yearly, monthly, or daily.
  3. Recent environmental data: These can be dated yearly, monthly, or daily.

2.3 Background

The Olinguito (Bassaricyon neblina) is a new species of carnivore described in 2013 from museum specimens that were previously identified as a different species [1]. The Olinguito lives in Northern Andean cloud forests in Colombia and Ecuador, and, according to experts, has strict tolerances for forest cover [2]. Recent occurrence records allow for very simple data-driven masking based on observed ranges of forest cover, but the number of occurrences (n = 18) is likely too few to use in model calibration. The data-driven determination of thresholds for masking using recent records represents a simple, conservative methodology that can be used for many species with limited recent records, and also will prove useful for processing of expert range maps or other pre-existing range estimates that do not take into account human modifications of the environment.

2.4 Methods

A 30 arc-second Bioclimatic-variable climate-based Maxent [1] SDM was tuned for the Olinguito using a mix of historical data paired with field notes, as well as contemporary citizen science records associated with photo-vouchers. The model was trained on 0.7 degree buffered circles around occurrence records and then projected to a 5-degree bounding box around occurrence records. We then removed areas from the SDM representing biogeographic regions where the species has not been documented, like the Eastern Cordillera of Colombia and other small non-contiguous areas.

Next, we used remotely sensed data of the annual percent forest cover for the region (MODIS Vegetation Continuous Fields; 250 m from the years matching the most recent occurrence records: 2006-2016). The MODIS rasters from different years were not all aligned with one another and had to be realigned to create a raster stack. A subset of the more recent occurrence records (2006-2016) were then temporally matched with the forest cover rasters to obtain values of forest cover per occurrence (one record was excluded because it was based on an observation that stemmed from a likely roadkilled specimen [2]). The forest cover values at occurrence records were then assessed across all years, and the 25th quantile forest cover level was used to generate a forest cover mask for the most recent year (2010). After resampling (e.g., making the resolutions match) the SDM to match the resolution of the MODIS data, we applied this forest cover mask to reduce the SDM’s spatial prediction to only those climatically suitable areas where forest cover is above the threshold. This reduced model shows a conservative estimation of the species range given current non-climatic variables that represent anthropogenic modification of land cover.

2.5 References

[1] Helgen, K.M., Miguel Pinto, C., Kays, R., Helgen, L.E., Tsuchiya, M. T. N., Quinn, A., Wilson, D.E., Maldonado, J.E. (2013) Taxonomic revision of the olingos (Bassaricyon), with description of a new species, the Olinguito. Zookeys, 324, 1-83. https://doi.org/10.3897/zookeys.324.5827.

[2] Saavedra-Rodríguez, C. & Rojas-Díaz, V. (2011) Bassaricyon gabbii Allen, 1876 (Carnivora: Procyonidae): new distribution point on western range of Colombian Andes. Check List 7(4), 505–507. http://dx.doi.org/10.15560/7.4.505.

[3] Phillips, S.J., Anderson, R.P., Schapire, R.E. (2006) Maximum entropy modeling of species geographic distributions. Ecological Modeling, 190, 231 - 259.

2.6 Example

By setting the pathway to where you saved our canned datasets, access that data by setting that as the working directory.

library(raster)
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.0.3
library(maskRangeR)
## To get started, see the demos with 
## vignette(package='maskRangeR')
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:raster':
## 
##     intersect, union
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## Remember that the 'path' is the directory in which you saved the demonstration data 
dataDir = paste0(path, '/dataDriven/olinguito/')

Read in the environmental and occurrence data. These can be dated yearly, monthly, or daily, but must match between dated occurrence records and dated recent environmental data.

dateScale <- "year" # or "month", "day"
env <- stack(list.files(path = dataDir, pattern = "olinguito_Modis", full.names = T))
# this should be a formal date object of class "POSIXct" "POSIXt"
envDates <- parse_date_time((c('2005','2006','2008','2009','2010')),orders = c("Y", "Ym"))
datedOccs <- read.csv(paste0(dataDir,'All_new_records_by_year.csv'))
sdm <- raster(paste0(dataDir,'olinguito_SDM.tif'))
# crop climate data to study region
env <- crop(env, extent(sdm))

NOTE: SDM was resampled to match the resolution of MODIS remotely sensed data.

The names of our env rasters reflect the dates (here, only years). Check that the rasters represent the same dates as envDates.

names(env)
## [1] "X2005_olinguito_Modis" "X2006_olinguito_Modis" "X2008_olinguito_Modis"
## [4] "X2009_olinguito_Modis" "X2010_olinguito_Modis"
envDates
## [1] "2005-01-01 UTC" "2006-01-01 UTC" "2008-01-01 UTC" "2009-01-01 UTC"
## [5] "2010-01-01 UTC"

Similarly, your dated occurrences must have a column named “date”. These will be converted to class ‘POSIXct’.

head(datedOccs)
##        long       lat date
## 1 -78.98300 -0.583000 2006
## 2 -79.00000 -0.417000 2006
## 3 -78.68400 -0.021000 2005
## 4 -78.73718 -0.014437 2013
## 5 -78.68832 -0.000220 2013
## 6 -78.71200  0.022600 2010

Prepare the data.

# convert dates to formal POSIXct objects
datedOccs$date <- parse_date_time(datedOccs$date,orders = c("Y", "Ym"))
# convert to spatial object
coordinates(datedOccs) <- c('long','lat')
projection(datedOccs) <- projection(env)

Ensure coordinate reference system, extent, and resolution are the same.

identical(crs(sdm), crs(env))
## [1] TRUE
identical(extent(sdm), extent(env))
## [1] TRUE
identical(res(sdm), res(env))
## [1] TRUE

To perform data-driven masking, first find the values of the environmental variable (forest cover in this example) at the occurrence point locations.

datedOccs <- maskRangeR::annotate(datedOccs, env, envDates, dateScale)
## Environmental layers were missing for date 2013
## Environmental layers were missing for date 2015
## Environmental layers were missing for date 2016
# Note the warnings:
# [1] "Environmental layers were missing for date 2013"
# [1] "Environmental layers were missing for date 2016"
# [1] "Environmental layers were missing for date 2015"
# These indicate that you have occurrence records that do not match some environmental data.
# You may want to look into this to determine whether you expected to provide environmental layers associated with these dates. But if you didn't mean to provide such layers (e.g., here we did not provide layers for 2013, 2015, and 2016) it is not an issue for using the package.

Then find a suitable lower limit for the mask based on the data.

## Take a look at the forest cover values at each occurrence point
sort(datedOccs$env)
## [1]  2.959184 46.836735 59.836735 62.653061 63.632652 71.000000
## Use quantiles to find a suitable lower limit
(bounds <- quantile(datedOccs$env,prob=c(0,.025,.25,.5,.75,.975,1),na.rm=T))
##        0%      2.5%       25%       50%       75%     97.5%      100% 
##  2.959184  8.443878 50.086735 61.244898 63.387754 70.079082 71.000000

Note that here, there is an occurrence record showing 4% forest cover. This occurrence record comes from a roadkilled specimen, so it is likely not representative of necessary forest cover. To simplify for this vignette, we use the 25% observed forest cover as the lower limit.

Create the mask and mask the potential distribution using the most recent year’s forest cover.

logicString <- paste0('maskLayers>',quantile(datedOccs$env,prob=.25,na.rm=T))
forest <- env[[5]]
names(forest) <- 'forest'
maskedDist <- maskRanger(initialDist=sdm,maskLayers=forest,
                                     logicString=logicString)

If you additionally wanted to include an upper bound on the masking layer, this can be acheived with another call to maskRanger. Although we do not have expert knowledge of an upper limit on forest cover for this species, we pretend that there is an upper bound at the .999 percentile to illustrate how the code would work. In this call, the new potentialDist is the output of the previous call to maskRanger.

logicString <- paste0('maskLayers<',quantile(datedOccs$env,prob=.999,na.rm=T))
forest <- env[[5]]
names(forest) <- 'forest'
output <- maskRanger(initialDist=maskedDist$refinedDist,
                                     maskLayers=forest,logicString=logicString)

Note that all maskRanger() outputs follow the convention that the output is a stack with the refined distribution, the intial distribution, and any layers used to add or subtract regions to obtain the refined distribution.

To be sure the masking worked, check that there are now fewer cells in the masked map.

(nOccupiedCells <- apply(values(maskedDist[[1:2]]),2,sum,na.rm=T))
## refinedDist initialDist 
##    8339.861   25767.904

Plot the results. The first map shows the percent forest cover across the study region in the most recent year. The second map shows regions where forest cover is above the 25th quantile threshold in green and the potential distribution from the SDM in grey. The third map compares the potential distribution (grey) to the masked distribution (red).

par(mfrow = c(1,3))
plot(forest,main = 'Raw Forest Data')
plot(output[['forestMask']],main = 'Forest Mask and Initial Distribution')
plot(sdm,add = T,col = c(grey(0,0),grey(.4,.7)))
plot(output$refinedDist,col = c(grey(.6),'red1'),main = 'Masked Distribution')

3 Expert Driven: Masking by expert maps

3.1 Issue

Range models may omit areas of presence, or include areas where the species does not exist. Expert knowledge can improve the realism of range models, either by adding suitable regions to a model or removing unsuitable ones.

3.2 Needs

These analyses require the following data:

  1. Expert defined area: This may be an area to add or remove to the species’ distribution.
  2. SDM: This may be from an IUCN assessment, a thresholded SDM, or similar binary range indicator.

3.3 Background

3.3.1 BioModelos

BioModelos is a web application for the collaborative development of species distribution models. In BioModelos, taxonomy/ecology/biogeography experts aid in

  1. cleaning occurrence data
  2. identifying species’ suitable land covers
  3. selecting adequate omission thresholds to create binary models
  4. identifying species’ accessible areas as well as areas of model over and underprediction
  5. qualitatively evaluating the biological realism of resulting model predictions.

A complete description of BioModelos is available here.

Here, we provide an example of 4. indentifying species’ accessible areas as well as areas of model over and underprediction. Using the web tools available in BioModelos, an expert may draw a polygon on top of a geographic viewer to indicate A) a species’ accessible area, B) an area that must be removed from a model, or C) an area that must be added to a model.

3.3.2 Howler Monkey Example

The Howler Monkey, Alouatta seniculus, exemplifies a situation in which expert opinion was gathered using BioModelos to make range estimates more realistic. This species occurs in Colombia, Ecuador, Peru, Venezuela and Brazil (IUCN 2018). In Colombia, howler monkeys are widely distributed, occurring from 0-3200 meters. In the Choco biogeographic region, however, they are absent. This pattern is likely due to the dominance of closed canopies in the Choco rather than to any barrier to A. seniculus dispersal (Defler, 2010) since the mantled howler Alouatta palliata is better adapted to closed canopies than is A. seniculus.

3.4 Methods

Records for A. seniculus in Colombia were curated independently by the Colombian Primatology Association and records from outside of Colombia were curated by the Primates Expert Group of BioModelos using the web app. A presence-only model for the distribution of this species was built using a background area spanning from north Peru to south Costa Rica and eastern Ecuador to Venezuela. This region was selected because it contains no known geographic barriers limiting the species’ dispersal.

The ENMeval package was used to select optimum feature classes/regularization values for the Maxent SDM. Records were partitioned using checkerboard2. The final model was run using all available occurrences and the regularization and feature class combination with the highest \(AUC_{test}\) in model testing. Experts in the primates group selected the minimum training presence threshold as the threshold that best represented the prevalence of suitable climatic conditions for the species in the country. Once the model was thresholded, they also identified areas of model over- and underprediction. Therefore models were post-processed, as described here.

The following steps are possible ways to improve the realism of a potential distribution based on expert opinion:

  1. Add polygon: in some cases an expert may consider that a model underpredicts the suitable area for a species. For example, when modeling the distribution of A. seniculus in Colombia, there were gaps in the Amazon that were judged by experts as errors. Those missing areas can be added to the model by directly adding a polygon onto a GIS, indicating that model values in the area should be 1, or present.
  2. Remove polygon (not shown here): in some other cases, experts may consider that the model is predicting an area where the species’ presence is not possible. Again, for A. seniculus, experts considered that predictions of the distribution of this species in the Guajira peninsula were unlikely. They opted to remove that area from the model by drawing a polygon on the map indicating that in the area, model values should be 0, or absent.
  3. Mask model: (not shown) models may also need to be masked to 1) a drawn polygon; 2) a user provided shapefile; 3) a user-provided list of land covers. For A. seniculus, we clipped first by a user-drawn map and later by national boundaries (as we were not interested in predictions beyond Colombia; not shown). Later (not shown), we clipped that map using a list of land covers suitable for the species, as defined by experts, using a national land cover map.

3.5 Example

By setting the pathway to where you saved our canned datasets, access that data by setting that as the working directory.

library(maskRangeR)
library(lubridate)
library(raster)
library(rgdal)
dataDir = paste0(path, '/dataDriven/Alouatta/')

Read in the original SDM and the expert polygon.

original = stack(paste0(dataDir, 'Alouatta_seniculus_0_mx.tif'))
expertAddedPoly = rgdal::readOGR(paste0(dataDir, 'Alouatta_seniculus_add_HO.shp'))

Plot the original SDM.

par(mfrow = c(1,1))
plot(original, main = "Range map")

Add the expert polygon to the SDM.

expertAddRaster <- rasterize(expertAddedPoly,original, 1)
expertAddRaster[is.na(expertAddRaster)] <- 0
final = original+expertAddRaster
# remove overlap
final[final>1] <- 1

Plot the new distribution model with the expert region added.

plot(final, main = "Range map with \n expert area added")
plot(expertAddedPoly,add = T)

Calculate the difference in range size.

cellStats(original,sum)
## [1] 5867203
cellStats(final,sum)
## [1] 5875066

If you were combining this step with a masking step in the previous example, it might be helpful to organize the output the same as all maskRanger() outputs. This follows the convention that the output is a stack with the refined distribution, the intial distribution, and any layers used to add or subtract regions to obtain the refined distribution.

output=stack(final,original,expertAddRaster)
names(output)=c("refinedDist", "initialDist","expertAdded")

4 Expert Driven: Multiple expert masks

4.1 Issue

Some species may not have published occurrence records from which to define environmental tolerances. In these cases, multiple environmental layers are thresholded to the observed tolerances for the species. A range map is then masked by the tolerances, thereby reducing the area of the range map to only those areas suitable based on all environmental data.

4.2 Needs

These analyses require the following data:

  1. Environmental layers: These need to be layers deemed important to the species’ biology.
  2. Environmental tolerances: These tolerances are used to threshold the environmental layers.
  3. A range map: This map is generally a simplified overestimation of the range of the species and will therefore be reduced through a series of masks.

4.3 Background

The “swamp forest crab” (Parathelphusa reticulata) is a freshwater crab endemic to the Nee Soon Swamp Forest, Singapore’s last remaining patch of freshwater swamp forest, and is currently listed as Critically Endangered by the IUCN. Like many freshwater crabs of the Southeast Asia, P. reticulata is primarily a scavenger that feeds on coarse organic matter, such as leaf litter. Given its endemic and endangered status, the ideal habitat parameters in which this species is found are highly restricted. However, available distribution information is quite limited: there is only one locality for this species available in GBIF, and the IUCN range map is a coarse overestimate. According to expert opinion, the environmental variables most closely associated with this species’ geographic range include distance to forest streams/swamps, pH, dissolved oxygen, temperature, elevation, and canopy cover. However, building a niche-based distribution model is not feasible due to considerable data and sample size limitations that likely violate the assumptions of the approach; the available high quality georeferenced occurrence data (N=1) do not capture the range of habitats occupied by the species.

4.4 Methods

In this use case, we will perform an expert-driven species range estimate using a series of informed masks that constrain the coarse IUCN range map to a more realistic area of suitable environmental conditions. Ideally, we would then constrain the range to: areas with aquatic pH ranging from 5.17-5.79, dissolved oxygen from 27.2-36.6%, daytime air temperature 26-34˚C, elevation 16-26m, and canopy cover 75-100%. These tolerances were developed following Chua et al. (2015)[1], a study that conducted fine-scale sampling in two of the nature reserves in Singapore known to have swamp forest crabs and analyzed the relationship between crab communities and environmental variables. However, modeling natural systems is often hindered by lack of data availability. Due to the species’ rarity, the occurrence data are not publicly available. Similarly, high resolution raster products representing all of these environmental variables may not exist or not be available. Because of these limitations, we did not include several layers in our analyses. In this example, we did not provide data pertaining to dissolved oxygen, aquatic pH, or standing or moving waterbodies.

However several of the environmental correlates described above are provided. Data layers include Mean Annual Temperature data from worldclim.org [2] as a substitute of daytime air temperature, elevation from LPDAAC [3], canopy cover from Hansen et al., 2013 [4]. Resolutions differ between layers, so rasters were either resampled or dis/aggregated to match the mask.

4.5 References

[1]Chua KWJ, Ng DJJ, Zeng Y, Yeo DCJ. 2015. Habitat characteristics of tropical rainforest freshwater crabs in Singapore. Journal of Crustacean Biology 35(4): 533-539. https://academic.oup.com/jcb/article/35/4/533/2547833
[2] Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G., & Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas. International journal of climatology, 25(15), 1965-1978.

[3] NASA/METI/AIST/Japan Space Systems. (2009). ASTER global digital elevation model [data set]. NASA EOSDIS Land Processes DAAC.
[4] Hansen, M. C., Potapov, P. V., Moore, R., Hancher, M., Turubanova, S. A. A., Tyukavina, A., … & Kommareddy, A. (2013). High-resolution global maps of 21st-century forest cover change. science, 342(6160), 850-853.

4.6 Example

By setting the pathway to where you saved our canned datasets, access that data by setting that as the working directory.

library(maskRangeR)
library(raster)
library(rgdal)
## rgdal: version: 1.5-17, (SVN revision 1070)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
## Path to GDAL shared files: C:/Users/pgalante/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: C:/Users/pgalante/Documents/R/win-library/4.0/rgdal/proj
## Linking to sp version:1.4-4
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
dataDir = paste0(path, '/expertDriven/swampForestCrab/')

Read in the expert map and the environmental layers.

expertMap <- readOGR(paste0(dataDir,'IUCNshape/data_0.shp'))

#masks
maskListRaw <- list(treeCover = raster(paste0(dataDir,'/Hansen_percent_treecover.tif')), 
                    dem = raster(paste0(dataDir,'/DEM.tif')),
                    mat = raster(paste0(dataDir,'/MATsingapore.tif')))

# define the limits for each mask
maskBounds <- read.csv(paste0(dataDir,'/crabInfo.csv'))

Conduct data cleaning to fix names and units.

# Divide worldclim values for temperature by 10
maskListRaw$mat <- maskListRaw$mat / 10
# rename the limits to match the raster names in maskListRaw
maskBounds$Layer <- c('treeCover','dem', 'mat')

Crop rasters to the same extent as the expert map, since we do not need to build masks outside the expert map.

crt <- maskRangeR::cropResampleTrim(expertMap,maskListRaw)
## Resampling, which can be a little slow...

Create each mask, and mask the expert map.

expertRaster <- crt$expertRaster
maskStack <- crt$maskStack
out <- lotsOfMasks(expertRaster,maskStack,maskBounds)

Plot the realized distribution, along with each of masks.

plot(out)

We can zoom in to visualize what is left of the realized distribution after the masking process. The black polygon represents the original expert map and red locations indicate those that remain as suitable habitat after the masks have been applied.

plot(out$initialDist,col='grey')
plot(out$refinedDist,col = 'red',add=T)

Calculate the difference in range size.

cellStats(expertRaster,sum)
## [1] 32750
cellStats(out$refinedDist,sum)
## [1] 224

5 Data Driven: Masking by biotic interactions

5.1 Issue

In addition to abiotic factors such as forest cover, temperature, elevation, etc., species’ distributions are also affected by biotic interactions [1]. For example, putative competitor species often have adjacent but non-overlapping, i.e. parapatric, ranges. In these situations, range estimates based on predictive models may overestimate ranges if the distribution of competitor species is not taken into account. To create more realistic models, we can post-process SDMs for each species to consider biotic constraints, e.g. the presence of a supposed competitor.

5.2 Needs

These analyses require the following data:

  1. Occurrence points: georeferenced records for each species.
  2. SDMs for each species: If using a hybrid support vector machine (SVM), i.e. a classifier that accounts for the predicted suitability of habitat from the SDMs, SDM rasters of habitat suitability predictions for each species. The SDMs need to be projected to the same region and the resulting rasters need to have the same extent.

5.3 Background

The three-toed sloths (Bradypus spp.) comprise four species distributed across Central and South America. With the exception of B. pygmaeus, which is restricted to Isla Escudos de Veraguas, the other three Bradypus species range across mainland Central and South America: B. variegatus is widely distributed from eastern Honduras to northern Argentina, B. tridactylus is found in the Guiana shield region, and B. torquatus occurs only in the Brazilian Atlantic Forest. B. variegatus and B. tridactylus exhibit parapatry, which is likely caused by competition for resources and/or suitable habitat [2]. This situation of congeneric species replacing each other across space is common. B. variegatus and B. torquatus, on the other hand, are sympatric in regions of the latter’s habitat in the Brazilian Atlantic Forest [3]. This analysis delimits the distributions of these three species by masking out regions predicted to be more suitable for the other two species.

5.4 Methods

2.5 arc-minute Bioclimatic-variable climate-based Maxent SDMs were tuned for each species using occurrence data from literature thinned to a distance of 40 km. The SDMs were trained on a background region of 4-degree buffered circles around the thinned occurrence points and projected to a 2-degree bounding box around occurrence records.

After obtaining SDMs for each of the three species, we used the process below to create support vector machines (SVMs) to classify areas predicted to be more suitable for each of the three species. SVMs can be either spatial, using only the occurrence records of each species as predictors, or hybrid, using the habitat suitability from the SDMs as predictors in addition to the occurrence records. These SVMs are then used to mask the SDMs of each species to regions not predicted to be occupied by the other two species.

5.5 References

[1] Anderson, R.P. (2017) When and how should biotic interactions be considered in models of species niches and distributions? Journal of Biogeography, 44(1), 8-17.

[2] de Moraes-Barros, N., Giorgi (2010) Reevaluation of the Geographical Distribution of Bradypus tridactylus Linnaeus, 1758 and B. variegatus Schinz, 1825. Edentata, 11(1), 53-61.

[3] Hirsch, A., Chirello, A.G. (2012) The endangered maned sloth Bradypus torquatus of the Brazilian Atlantic forest: a review and update of geographical distribution and habitat preferences. Mammal Review, 42(1), 35-54.

5.6 Example

By setting the pathway to where you saved our canned datasets, access that data by setting that as the working directory.

library(maskRangeR)
library(raster)
library(maptools)
## Checking rgeos availability: TRUE

Read in the occurrence data and the SDMs for each species. Note that the SDMs must be projected to the same region for input into the hybrid SVMs. For the final step of masking, however, apply the mask to the SDM projected only to the region of biological interest for each species.

dataDir = paste0(path, '/dataDriven/bradypus/')

# Species occurrence coordinates
variegatus <- read.csv(paste0(dataDir,'variegatus_occ.csv'))
tridactylus <- read.csv(paste0(dataDir,'tridactylus_occ.csv'))
torquatus <- read.csv(paste0(dataDir,'torquatus_occ.csv'))

# SDMs projected to the same extent
var_sdm <- raster(paste0(dataDir,'variegatus_sdm.tif'))
tri_sdm <- raster(paste0(dataDir,'tridactylus_sdm.tif'))
tor_sdm <- raster(paste0(dataDir,'torquatus_sdm.tif'))

# Crop tridactyus SDM to a 2-degree bounding box
ext_tri <- extent(c(min(tridactylus$longitude)-2, max(tridactylus$longitude)+2, min(tridactylus$latitude)-2, max(tridactylus$latitude)+2))
tri_sdm_mask <- mask(tri_sdm, as(ext_tri, "SpatialPolygons"))

# Crop torquatus SDM to a 2-degree bounding box
ext_tor <- extent(c(min(torquatus$longitude)-2, max(torquatus$longitude)+2, min(torquatus$latitude)-2, max(torquatus$latitude)+2))
tor_sdm_mask <- mask(tor_sdm, as(ext_tor, "SpatialPolygons"))

The following are the SDMs for each species, projected to the same extent and visualized on a map of South America:

# Get world map
data("wrld_simpl")
# Create list of South American countries for visualization
country_list <- c("Honduras", "Nicaragua", "Costa Rica", "Panama", "Colombia",
                  "Ecuador", "Peru", "Bolivia", "Chile", "Argentina", "Uruguay",
                  "Paraguay", "Brazil", "French Guiana", "Suriname", "Guyana",
                  "Venezuela")

# Create 3-panel figure
par(mfrow = c(1,3))

# Plot B. variegatus SDM
plot(var_sdm, xlim = c(-90, -30), ylim = c(-27, 16))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)
# Plot B. tridactylus SDM
plot(tri_sdm_mask, xlim = c(-70, -49), ylim = c(-6, 11))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)
# Plot B. torquatus SDM
plot(tor_sdm_mask, xlim = c(-46, -34), ylim = c(-24, -8))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)

5.6.1 Spatial SVMs

Spatial SVMs are created using a classifier trained with the occurrence points of each species, without accounting for habitat suitability according to the SDM.

Tune the model and use it to predict ranges for each of the three species.

set.seed(1234) # to reproduce our exact results
# Create SVM
svmSP <- rangeSVM(variegatus[,2:3], tridactylus[,2:3], 
                  torquatus[,2:3], nrep = 8, mc.cores=mc.cores)

# Use SVM to create a raster of predicted regions
sloth_svmSP <- rangeSVM_predict(svm = svmSP, r = var_sdm)

Plot the SVM results. The grey areas indicate regions more suitable for B. variegatus, the yellow areas indicate regions more suitable for B. tridactylus, and the green areas indicate regions more suitable for B. torquatus. Black points show the occurrences of B. variegatus, red show B. tridactylus, and blue show B. torquatus.

plot(sloth_svmSP, col=c("yellow","pink","lightblue"))
points(variegatus[,2:3], pch = 20, cex = 0.75, col = "orange")
points(tridactylus[,2:3], pch = 20, cex = 0.75, col = "red")
points(torquatus[,2:3], pch = 20, cex = 0.75, col = "blue")

We can use this SVM as a mask over our original SDM predictions.

# masked SDM predictions for variegatus
var_svmSP <- sloth_svmSP == 1
var_svmSP[var_svmSP == 0] <- NA
var_svmSP_mask <- mask(var_sdm, var_svmSP)
out.var=stack(var_svmSP_mask,var_sdm,var_svmSP)
names(out.var)=c("refinedDist", "initialDist","var_svmSpMask")

# masked SDM predictions for tridactylus
tri_svmSP <- sloth_svmSP == 2
tri_svmSP[tri_svmSP == 0] <- NA
tri_svmSP_mask <- mask(tri_sdm_mask, tri_svmSP)
out.tri=stack(tri_svmSP_mask,tri_sdm,tri_svmSP)
names(out.tri)=c("refinedDist", "initialDist","tri_svmSpMask")

# masked SDM predictions for torquatus
tor_svmSP <- sloth_svmSP == 3
tor_svmSP[tor_svmSP == 0] <- NA
tor_svmSP_mask <- mask(tor_sdm_mask, tor_svmSP)
out.tor=stack(tor_svmSP_mask,tor_sdm,tor_svmSP)
names(out.tor)=c("refinedDist", "initialDist","tor_svmSpMask")

Plot the predicted realized distributions for each species.

# Create 3-panel figure
par(mfrow = c(1,3))

plot(var_svmSP_mask, xlim = c(-90, -30), ylim = c(-27, 16))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)
plot(tri_svmSP_mask, xlim = c(-70, -49), ylim = c(-6, 11))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)
plot(tor_svmSP_mask, xlim = c(-46, -34), ylim = c(-24, -8))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)

5.6.2 Hybrid SVMs

Hybrid SVMs take into account the occurrence points of all species as well as the environmental suitability predicted by the species’ SDMs.

set.seed(1234)
# Create SVM: we used a few more reps here to be sure a set of optimal parameters was found
svmHYB <- rangeSVM(variegatus[,2:3], tridactylus[,2:3], torquatus[,2:3], 
                   sdm = raster::stack(var_sdm, tri_sdm, tor_sdm), 
                   nrep = 8, mc.cores=mc.cores)

# Use SVM to create a raster of predicted regions
sloth_svmHYB <- rangeSVM_predict(svm = svmHYB, r = var_sdm, 
                                 sdm = raster::stack(var_sdm, tri_sdm, tor_sdm))

Plot the SVM results.

plot(sloth_svmHYB, col=c("yellow","pink","lightblue"))
points(variegatus[,2:3], pch = 20, cex = 0.75, col = "orange")
points(tridactylus[,2:3], pch = 20, cex = 0.75, col = "red")
points(torquatus[,2:3], pch = 20, cex = 0.75, col = "blue")

Use the hybrid SVM as a mask over our SDM predictions.

# masked SDM predictions for variegatus
var_svmHYB <- sloth_svmHYB == 1
var_svmHYB[var_svmHYB == 0] <- NA
var_svmHYB_mask <- mask(var_sdm, var_svmHYB)

# masked SDM predictions for tridactylus
tri_svmHYB <- sloth_svmHYB == 2
tri_svmHYB[tri_svmHYB == 0] <- NA
tri_svmHYB_mask <- mask(tri_sdm_mask, tri_svmHYB)

# masked SDM predictions for torquatus
tor_svmHYB <- sloth_svmHYB == 3
tor_svmHYB[tor_svmHYB == 0] <- NA
tor_svmHYB_mask <- mask(tor_sdm_mask, tor_svmHYB)

Plot the predicted realized distributions for each species.

# Create 3-panel figure
par(mfrow = c(1,3))

plot(var_svmHYB_mask, xlim = c(-90, -30), ylim = c(-27, 16))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)
plot(tri_svmHYB_mask, xlim = c(-70, -49), ylim = c(-6, 11))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)
plot(tor_svmHYB_mask, xlim = c(-46, -34), ylim = c(-24, -8))
plot(wrld_simpl[wrld_simpl$NAME %in% country_list,], add = TRUE)

After creating masks such as these based on biotic factors, one could further mask by forest cover, accessible habitat, or other factors.

6 Sensitivity checking

6.0.1 Issue

Thresholds used for masking may not be optimally suited for your species. Testing the effects that different thresholds may have on the predicted area can allow you to select the best threshold, manually.

6.0.2 Needs

Sensitivity analysis requires the following data: 1. Potential thresholds to use 2. Outputs from previous funtions in this vignette

6.0.3 Methods

  1. thresholdSensitivity: Testing the sensitivity of using different thresholds on the mask layer. For example, using a range of forest cover percentages to examine the differences in predicted area.
  2. manyMaskSensitivity: Tests the sensitivity of using different masks and different combinations of masks. For example, testing all combinations of the layers.

6.1 Example 1

6.1.1 Threshold sensitivity

Set the pathway to where you saved our canned datasets.

library(raster)
library(maskRangeR)
library(lubridate)
## Remember that the 'path' is the directory in which you saved the demonstration data 
dataDir = paste0(path, '/dataDriven/olinguito/') 
dataDir = "C:/Users/pgalante/Git/maskRangeR"

Read in the environmental and occurrence data.

dateScale <- "year" # or "month", "day"
env <- stack(list.files(path = paste0(dataDir), pattern = "olinguito_Modis.tif$", full.names = T))
# this should be a formal date object of class "POSIXct" "POSIXt"
envDates <- parse_date_time((c('2005','2006','2008','2009','2010')),orders = c("Y", "Ym"))
datedOccs <- read.csv(paste0(dataDir,'All_new_records_by_year.csv'))
sdm <- raster(paste0(dataDir,'olinguito_SDM.tif'))
# crop climate data to study region
env <- crop(env, extent(sdm))

Prepare the data.

# convert dates to formal date objects
datedOccs$date <- parse_date_time(datedOccs$date,orders = c("Y", "Ym"))
# convert to spatial object
coordinates(datedOccs) <- c('long','lat')
projection(datedOccs) <- projection(env)

To perform data-driven masking, first find the values of the environmental variable (forest cover in this example) at the occurrence point locations.

datedOccs <- maskRangeR::annotate(datedOccs,env,envDates,dateScale)
## Environmental layers were missing for date 2013
## Environmental layers were missing for date 2015
## Environmental layers were missing for date 2016
# Note the warnings:
# [1] "Environmental layers were missing for date 2013"
# [1] "Environmental layers were missing for date 2016"
# [1] "Environmental layers were missing for date 2015"
# These indicate that you have occurrencee records that do not match some environmental data.
# You may want to look into this to determine whether you expected to provide environmental layers associated with these dates. But if you didn't mean to provide such layers (e.g., here we did not provide layers fro 2013, 2015, and 2016) it is not an issue for using the package.

Then find a suitable lower limit for the mask based on the data.

## Take a look at the forest cover values at each occurrence point
sort(datedOccs$env)
## [1]  2.959184 46.836735 59.836735 62.653061 63.632652 71.000000
## Use quantiles to find a suitable lower limit
(bounds <- quantile(datedOccs$env,prob=c(0,.025,.25,.5,.75,.975,1),na.rm=T))
##        0%      2.5%       25%       50%       75%     97.5%      100% 
##  2.959184  8.443878 50.086735 61.244898 63.387754 70.079082 71.000000

Note that here, there is an occurrence record showing 4% forest cover. This occurrence record comes from a roadkilled specimen, so it is likely not representative of necessary forest cover. To simplify for this vignette, we use the 25% observed forest cover as the lower limit.

Create the mask and mask the potential distribution using the most recent year’s forest cover.

logicString <- paste0('maskLayers>',quantile(datedOccs$env,prob=.25,na.rm=T))
forest <- env[[5]]
names(forest) <- 'forest'
maskedDist <- maskRanger(initialDist=sdm,maskLayers=forest,
                                     logicString=logicString)

thresholdSensitivity() runs the ‘maskRangeR()’ function over a series of thresholds and plots these as well as the areas.

ThreshSens <- thresholdSensitivity(datedOccs = datedOccs, maskLayer=forest, maskClass = "top5", sdm = sdm)
threshold.stack <- ThreshSens$Sensitivity_Rasters
e <- raster::extent(c(-79.76875, -74.62083, -3.59375, 8.072917))
threshold.stack <- raster::crop(threshold.stack, e)

Take a look at the SDMs when each of the thresholds of interest are applied.

par(mfrow = c(3,2))
plot(threshold.stack[[1]])
plot(threshold.stack[[2]])
plot(threshold.stack[[3]])
plot(threshold.stack[[4]])
plot(threshold.stack[[5]])

Plot how the area of the SDM can be affected by the threshold used

sensitivityAreas <- ThreshSens$Areas
# set a color palette
colPal <- grDevices::rainbow(5)
# Organize the threshold values
#nums <- gsub(".*= ", "", ThreshSens$Areas)
#nums <- gsub( ").*$", "", nums)
nums <- ThreshSens$Areas$sensitivityAreas
nums <- format(round(unlist(lapply(nums, function(x) as.numeric(as.character(x)))), 3))

# plot
graphics::plot(rev(ThreshSens$Areas$maskValues), ThreshSens$Areas$sensitivityAreas, ylab = "Area (square km)", xlab = "Mask values", main = "Mask Threshold Area Sensitivity", type= "l")
graphics::points(rev(ThreshSens$Areas$maskValues), ThreshSens$Areas$sensitivityAreas, ylab = "Area (square km)", xlab = "Mask values", main = "Mask Threshold Area Sensitivity", pch = c(15, 13, 17,18,19))
#graphics::axis(labels=NA,side=1,tck=-0.015,at=maskValues)
graphics::text(rev(ThreshSens$Areas$maskValues), ThreshSens$Areas$sensitivityAreas, labels = nums, cex= 0.5, pos=1, offset = 0.09)
# Add a legend
graphics::legend(x = "bottomleft", legend = substr(names(ThreshSens$Sensitivity_Rasters), 1, 17), pch = c(15, 13, 17, 18, 19), cex=1)

6.2 Example 2

6.2.1 Masks sensitivity

Set the pathway to where you saved our canned datasets.

library(maskRangeR)
library(raster)
library(rgdal)
dataDir = paste0(path, '/expertDriven/swampForestCrab/')

Read in the expert map and the environmental layers.

expertMap <- readOGR(paste0(dataDir,'IUCNshape/data_0.shp'))

#CM: pH and Water bounds are mutually exclusive, so we can only choose one, otherwise there is no suitable habitat. as I understand it, the soil grids shouldn't be taken too seriously at fine grain
#masks
maskListRaw <- list(treeCover = raster(paste0(dataDir,'/Hansen_percent_treecover.tif')), 
                    dem = raster(paste0(dataDir,'/DEM.tif')),
                    mat = raster(paste0(dataDir,'/MATsingapore.tif')))
# define the limits for each mask
maskBounds <- read.csv(paste0(dataDir,'/crabInfo.csv'))

Conduct data cleaning to fix names and units.

# rename the limits to match the raster names in maskListRaw
maskBounds$Layer <- c('treeCover','dem','mat')
# make sure the units match between the layers and the bounds
maskListRaw$mat <- maskListRaw$mat/10

Crop rasters to the same extent as the expert map, since we do not need to build masks outside the expert map.

crt <- maskRangeR::cropResampleTrim(expertMap,maskListRaw)
## Resampling, which can be a little slow...

Create each mask, and mask the expert map.

expertRaster <- crt$expertRaster
maskStack <- crt$maskStack

manyMaskSensitivity compares the predicted areas from all possible combinations of predictor variables.

manyMaskSensitivity(crt, maskBounds = maskBounds, expertRaster = expertRaster)

##                           Area          Layer
## c("treeCover", "mat") 49.97131 treeCover, mat
## c("treeCover", "dem") 86.82694 treeCover, dem
## c("dem", "mat")       238.5139       dem, mat
## treeCover             320.3625      treeCover
## dem                   377.1952            dem
## mat                   661.3509            mat