Make sure you have the following R packages installed
library(igraph)
library(vegan)
library(raster)
library(rworldmap)
library(betapart)
library(RColorBrewer)
library(stringr)
library(maps)
# function to colorize https://gist.github.com/fgabriel1891/2fb219c376ec3e6b33402dc099a7e294
f <- function(x,n=10, pal, rev = F){
if(rev == F){
rev(RColorBrewer::brewer.pal(n, pal))[cut(x,n)]
}else{
(RColorBrewer::brewer.pal(n, pal))[cut(x,n)]
}
}
Remember that you can install packages in R with install.packages("package name")
The characterization of geographical areas in terms of biodiversity.
Areas with distinct evolutionary stories.
Comprehensive understanding of the drivers of biological organization
Since early societies, people have been interested to study the distribution of life in the planet. In the illustration age, Carl Linnaeus and other natural historians like Humboldt, Wallace, and Darwin sistematized the study of biodiversity distribution. Since then, centuries of scientific research produced invaluable information on biodiversity. Nowadays, this corpus of biodiversity knowledge is being aggregated into large-scale datasets which are freely available in the internet. For example, digital information infrastructures such as the Global Biodiversity Information Facility (GBIF) increases the accesiblity of data on the distribution of life across the planet. These, digitally available data have various origins such as biodiversity inventories, field research reports, field observations, citizen science, etc.
There are two main ways how the geographic distribution of species is digitally stored in databases.
This is a simple format to store information on species distribution. At the minimal level a geocoded point involves tree variables which are the 2 spatial coordinates and the species’ (or other taxonomic category) name. This format is probably the most common way which information in the geography of biodiversity is stored in the globe. This simple format make it easy to store and takes little space in memory. However, relying only on geocoded points has is limitations. For instance the spatial patterns can be strongly because of sampling effort. For poorly sampled species, a single new observation can dramatically increase the extent of the species geographic distribution. Similarly, the sparsity of observations may influence community similarity measures. This problem is particularly accentuated in biodiversity rich and/or poorly sampled regions of the globe such as the tropics, mountains, or near the poles.
The delineation of species geographical ranges is second commonly used format to store information on species distribution. Instead of points, species ranges represent a portion of any representation of the geographical surface of the planet. The delineation is often done based the information on species point data + climate models + expert knowledge. Digital information of species ranges usually come stored as vector polygons or rasters. For instance, a common format is the convex hull, which is the smallest set in space that includes all the species points. More sophisticated methods, such as species distribution modelling uses climatic variation to infer species environmental niches by modelling probabilities of species occurrences in space. Conservation agencies such as the IUCN use this species distribution models along taxonomic and local experts to delimit a more continuous surface to represent the spatial range of a species. Species ranges hold more information than geocoded points at the species level, however biodiversity information at this level of detail is only available for a limited number of taxa.
x | y | sp |
---|---|---|
36 | -15 | Sp 5 |
35 | -22 | Sp 4 |
23 | -6 | Sp 2 |
32 | -10 | Sp 2 |
5 | -36 | Sp 2 |
7 | -32 | Sp 5 |
31 | -14 | Sp 2 |
14 | -23 | Sp 4 |
39 | -7 | Sp 4 |
29 | -13 | Sp 2 |
The region of interest corresponds to the space encompassing the sum of the biodiversity total variance. The ROI is the space which we are interested to classify into definable regions based on its biodiversity.
Depending of the research question or because information limitation, researchers often limit the region of interest to an extent. Boundaries for a region of interest can be placed in geographical space, climatic space, across a temporal axis, and to particular taxonomic groups.
This matrix holds information on the presence/absence or relative abundances of a set of species throughout a series of localities (sites). In our tutorial, the sites correspond to each of the 1x1 degree grids. For this tutorial we will focus only on the aspect of biodiversity change that comes from the variance of species presences and absences in the ROI. Therefore, entries in the species-site matrix are coded as 1 if a species \(j\) is present at a grid \(n\), and 0 otherwise. Row vectors of this matrix represent the spatial variation of each species across the ROI. Column vectors represent the composition of a site in terms of species.
\[\begin{bmatrix} & site1 & site2 & ... & site_n \\Sp1 & 1 & 0 & ...& 1 \\Sp2 & 0 & 1 & ...& 1 \\Sp3 & 1 & 1 & ...& 0 \\Sp4 & 1 & 0 & ...& 1 \\\vdots & \vdots & \vdots & \vdots & \vdots \\Sp_j & 1 & 1 & 1 & 0 \end{bmatrix}\]
Resolution and scale is important to consider before we start to examine biodiversity variation in space. Importantly, because of data limitation at larger scales often is necessary to aggregate point pattern data. The chosen grid resolution both at the area and limits can influence the overall patterns we may be able to identify.
Discussion point:
How different choices of grid resolution affect our observations of spatial patterns in biodiversity?
What is the relationship between resolution and scale?
This differences in species composition is commonly known as species turnover and it is a separate component (ß) of biodiversity. We will compute pairwise dissimilarities in grid species composition. Pairwise dissimilarities in species composition among communities are in function of the number of shared species and the sum of the species diversity of a pairwise set of communities. There are several metrics of community dissimilarity. I encourage you to explore that further in your own. In this tutorial we will use the Simpson dissimilarity index as a measure of species turnover.
\[ S_i = \frac{a}{a + min(b,c)} \]
being \(a\) the number of shared species and \(min(b,c)\) the minimum number of species between two communities.
There are several advantages of Simpson dissimilarity index relative to other indexes. For instance Simpsons uses only the poorest community in the denominator to account for large differences in richness between communities. This means that the index is not biased by strong richness differences between sites, therefore removing the contribution of nestedness to the pairwise dissimilarities
High nestedness
\[\begin{bmatrix} & site1 & site2 & site_3 \\Sp1 & 1 & 1 & 1 \\Sp2 & 1 & 1 & 1 \\Sp3 & 1 & 1 & 0 \\Sp4 & 1 & 1 & 0 \\Sp5 & 1 & 1 & 0 \\Sp6 & 1 & 0 & 0 \\Sp7 & 1 & 0 & 0 \\Sp8 & 1 & 0 & 0 \\Sp9 & 1 & 0 & 0 \end{bmatrix}\]
High turnover
\[\begin{bmatrix} & site1 & site2 & site_3 \\Sp1 & 1 & 1 & 1 \\Sp2 & 1 & 1 & 1 \\Sp3 & 1 & 1 & 0 \\Sp4 & 1 & 1 & 0 \\Sp5 & 0 & 1 & 0 \\Sp6 & 0 & 1 & 0 \\Sp7 & 0 & 1 & 0 \\Sp8 & 0 & 0 & 1 \\Sp9 & 0 & 0 & 1 \end{bmatrix}\]
Biogeographic regions are defined by the variance of the species dissimilarities among sites. There are two main ways how we can approach this problem of linearizing multidimensional variation in community dissimilarities to define biogeographic regions. We can classify or communities into discrete categories or continuously ordinate them in fewer multivariate dimensions.
K-means is a fast clustering algorithm. The k-means algorithm searches to partition the variance of a matrix into groups of k clusters centered around k means.
This is common methodology for un-supervised classification tasks. That is, we let the data to tell us the best partition into separate classes.
Standardizing distances
K-means algorithm assumes euclidean distances to compute k clusters. Simpson dissimilarity index index does not project linearly into euclidean space. The Hellinger standardization is a workaround which let’s us use euclidean metrics with community dissimilarity metrics.
\[y_{i,j} = \sqrt{\frac{y_{i,j}}{y_i}}\]
The main limitation of the k-means algorithm is that the cluster association vector is always dependent on the values of k. How do we know the best k then?. This therefore becomes an optimization problem for the value of k.
A solution to find the best k is to iterate the k-means algorithm for a series of k within a range of reasonable estimate for the number of clusters given our community matrix. (e.g k << S or k < S not k = S or k > S)
We optimize the ratio between the variance within k given clusters in relation with the total amount of variation in the matrix. We select the k parameter which the sum of squares of the distances between k clusters (i.e. the distances of the k-centroids to its mean centroid) is closest to the sum of squares of the whole matrix.
Dimensionality reduction comprise various statistical techniques which looks to fit as much variance possible into the first few of a set of orthogonal linear multivariate vectors. There are various parametric techniques for dimensionality reduction, being Principal component analysis (PCA) among the most widely use. Non-parametric techniques also exist such as the Non-Metric Multidimensional (NMDS), which finds the most parsimonious arrangement into planar space that preserve the community wide pattern of pairwise distances among entries in a matrix.
Principal component analysis
http://ordination.okstate.edu/overview.htm
https://wiki.qcbs.ca/r_workshop9
Non-metric multidimensional scaling NMDS:
The species-site matrix can also be conceptualized as a bipartite graph. In this graph, Vertices (nodes) are either species or sites. Edges (links) connect vertices based on the occurrence of a given species in a site. This is called a bipartite graph because edges connect nodes that belong to a different set. That is, edges in this graph exist between species and sites but not among species, nor sites. A graph representation opens up various possibilities to examine the structure of species turnover across space.
Incidency matrix
\[ G = \begin{bmatrix} & site1 & site2 & site3 & site_3 \\Sp1 & 1 & 0 & 1& 1 \\Sp2 & 0 & 1 & 0& 1 \\Sp3 & 1 & 1 & 1& 0 \\Sp4 & 1 & 0 & 1 & 1 \\Sp_5 & 1 & 1 & 1 & 0 \end{bmatrix}\]
Bipartite graph
## Warning in igraph::layout.bipartite(grah): vertex types converted to logical
Adjacency matrix = A
Species1 | Species3 | Species4 | Species5 | Species2 | Site1 | Site2 | Site3 | Site4 | |
---|---|---|---|---|---|---|---|---|---|
Species1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
Species3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
Species4 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
Species5 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
Species2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
Site1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Site2 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
Site3 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Site4 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
Degree matrix = D
Species1 | Species2 | Species3 | Species4 | Site1 | Site2 | Site3 | Site4 | Site5 | |
---|---|---|---|---|---|---|---|---|---|
Site1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Site2 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Site3 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
Site4 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
Site5 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
Species1 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
Species2 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
Species3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
Species4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
Laplacian matrix = L
\[ L = D-A \]
Species1 | Species3 | Species4 | Species5 | Species2 | Site1 | Site2 | Site3 | Site4 | |
---|---|---|---|---|---|---|---|---|---|
Species1 | 3 | 0 | 0 | 0 | 0 | -1 | 0 | -1 | -1 |
Species3 | 0 | 3 | 0 | 0 | 0 | -1 | -1 | -1 | 0 |
Species4 | 0 | 0 | 3 | 0 | 0 | -1 | 0 | -1 | -1 |
Species5 | 0 | 0 | 0 | 3 | 0 | -1 | -1 | -1 | 0 |
Species2 | 0 | 0 | 0 | 0 | 2 | 0 | -1 | 0 | -1 |
Site1 | -1 | -1 | -1 | -1 | 0 | 4 | 0 | 0 | 0 |
Site2 | 0 | -1 | 0 | -1 | -1 | 0 | 3 | 0 | 0 |
Site3 | -1 | -1 | -1 | -1 | 0 | 0 | 0 | 4 | 0 |
Site4 | -1 | 0 | -1 | 0 | -1 | 0 | 0 | 0 | 3 |
Spectral profile of a graph
\[ Av= \lambda v\]
Bryophytes are non-vascular plants that occur in many habitats across the globe. Bryophytes include liverworts, mosses, and hornworts. Bryophytes have been sampled extensively in various regions of the Atlantic and Pacific zones of Canada and the United States. For this tutorial we will quantify the spatial turnover of bryophyte communities across the US and Canada at 1ª resolution. Recently, the Beaty Museum of the University of British Columbia made available the information on their entire collection of bryophyte occurrences. The dataset is freely available through GBIF.
This dataset comes from the bryophyte collections of the University of British Columbia.
https://beatymuseum.ubc.ca/research-2/collections/herbarium/herbarium-bryophytes/
Let’s load the dataset, which is currently in the data folder stored as a .csv file.
briofCan <- read.csv("data/BriophitesCAN.csv",
header = T,
sep = "\t") # note we are specifing a tab-delimited file.
head(briofCan) # first rows
## gbifID datasetKey
## 1 1987794314 4edd9396-59df-4b01-9e29-dc21a59f9963
## 2 1987792636 4edd9396-59df-4b01-9e29-dc21a59f9963
## 3 1987792129 4edd9396-59df-4b01-9e29-dc21a59f9963
## 4 1987793965 4edd9396-59df-4b01-9e29-dc21a59f9963
## 5 1987793938 4edd9396-59df-4b01-9e29-dc21a59f9963
## 6 1987792195 4edd9396-59df-4b01-9e29-dc21a59f9963
## occurrenceID kingdom phylum
## 1 A77D8408-E621-B14B-ACC8-022FA5867C3B Plantae Bryophyta
## 2 A7856BA9-045B-D940-9EF8-86F982F61919 Plantae Bryophyta
## 3 A78F39BC-D010-F747-B245-24B9E067AFD3 Plantae Bryophyta
## 4 A796CD2C-24D6-2C46-89DD-E4A5E9B51738 Plantae Bryophyta
## 5 A798A3E3-65C6-1948-A621-B5ACA58AC3D7 Plantae Marchantiophyta
## 6 A79A61B8-6810-4243-9CA6-A7EC4872D999 Plantae Bryophyta
## class order family genus
## 1 Bryopsida Grimmiales Grimmiaceae Dryptodon
## 2 Andreaeopsida Andreaeales Andreaeaceae Andreaea
## 3 Bryopsida Grimmiales Grimmiaceae Dryptodon
## 4 Bryopsida Dicranales Dicranaceae Pilopogon
## 5 Jungermanniopsida Jungermanniales Plagiochilaceae Plagiochila
## 6 Bryopsida Pottiales Pottiaceae Hymenostylium
## species infraspecificEpithet taxonRank
## 1 Dryptodon patens SPECIES
## 2 Andreaea rupestris rupestris SUBSPECIES
## 3 Dryptodon patens SPECIES
## 4 Pilopogon guadalupensis SPECIES
## 5 Plagiochila schofieldiana SPECIES
## 6 Hymenostylium recurvirostrum SPECIES
## scientificName
## 1 Dryptodon patens Bridel, 1826
## 2 Andreaea rupestris subsp. rupestris
## 3 Dryptodon patens Bridel, 1826
## 4 Pilopogon gracilis (Hook.) Brid.
## 5 Plagiochila schofieldiana Inoue
## 6 Bryoerythrophyllum recurvirostrum (Hedw.) P.C.Chen
## verbatimScientificName
## 1 Dryptodon patens (Dicks. ex Hedw.) Brid.
## 2 Andreaea rupestris Hedw. subsp. rupestris
## 3 Dryptodon patens (Dicks. ex Hedw.) Brid.
## 4 Pilopogon gracilis (Hook.) Brid.
## 5 Plagiochila schofieldiana H. Inoue
## 6 Bryoerythrophyllum recurvirostre (Hedw.) Chen
## verbatimScientificNameAuthorship countryCode
## 1 (Dicks. ex Hedw.) Brid. CA
## 2 CA
## 3 (Dicks. ex Hedw.) Brid. US
## 4 (Hook.) Brid. PA
## 5 H. Inoue US
## 6 (Hedw.) Chen CA
## locality
## 1 Paulson Bridge on Rte 3 between Castlegar and Christina Lake
## 2 Princess Royal Island, Chapele Inlet; stream canyon near mouth, E side
## 3 Adak Quadrangle, Adak Is., Aleutian Islands, Mt. Reed; Co:
## 4 cerro colorado, 4.3 i. above chami camp. roadside along the ridge, ca 3 mi.
## 5 NW of Alexai Point, S slope of Gilbert Ridge, Attu Island, Aleutian Is.
## 6 Canoe Lake, S slope of ridge E of Canoe Lake; Co:
## stateProvince occurrenceStatus individualCount
## 1 British Columbia PRESENT NA
## 2 British Columbia PRESENT NA
## 3 Alaska PRESENT NA
## 4 Bocas Del Toro PRESENT NA
## 5 Alaska PRESENT NA
## 6 Northwest Territories PRESENT NA
## publishingOrgKey decimalLatitude decimalLongitude
## 1 b542788f-0dc2-4a2b-b652-fceced449591 49.2 -118.1
## 2 b542788f-0dc2-4a2b-b652-fceced449591 52.9 -129.1
## 3 b542788f-0dc2-4a2b-b652-fceced449591 51.8 -176.7
## 4 b542788f-0dc2-4a2b-b652-fceced449591 8.6 -81.8
## 5 b542788f-0dc2-4a2b-b652-fceced449591 52.9 173.2
## 6 b542788f-0dc2-4a2b-b652-fceced449591 68.2 -135.9
## coordinateUncertaintyInMeters coordinatePrecision elevation elevationAccuracy
## 1 7303 NA NA
## 2 6854 NA NA
## 3 7006 NA NA
## 4 11024 NA NA
## 5 6854 NA NA
## 6 4170 NA NA
## depth depthAccuracy eventDate day month year taxonKey speciesKey
## 1 NA NA 1977-08-05T00:00:00 5 8 1977 5281814 5281814
## 2 NA NA 1986-08-09T00:00:00 9 8 1986 7347162 5283962
## 3 NA NA 1986-08-06T00:00:00 6 8 1986 5281814 5281814
## 4 NA NA 1986-06-21T00:00:00 21 6 1986 8232977 2675529
## 5 NA NA 2000-08-12T00:00:00 12 8 2000 8003542 8003542
## 6 NA NA 1965-06-20T00:00:00 20 6 1965 2670498 7347186
## basisOfRecord institutionCode collectionCode catalogNumber recordNumber
## 1 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B18100 67210
## 2 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B107037 86589
## 3 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B207029 1986-187
## 4 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B110795 5341
## 5 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B185157 115694
## 6 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B81228 plot no. 31-6
## identifiedBy dateIdentified license rightsHolder
## 1 CC0_1_0 University of British Columbia
## 2 B.M. Murray,1988 CC0_1_0 University of British Columbia
## 3 W.B. Schofield CC0_1_0 University of British Columbia
## 4 CC0_1_0 University of British Columbia
## 5 W.B. Schofield CC0_1_0 University of British Columbia
## 6 O. Lee, 1985 CC0_1_0 University of British Columbia
## recordedBy typeStatus establishmentMeans
## 1 W.B. Schofield, G.F. Otto
## 2 W.B. Schofield
## 3 Stephen S. Talbot
## 4 B.H. Allen
## 5 W.B. Schofield, S.S. Talbot
## 6 J. Lambert, D. Morrison
## lastInterpreted mediaType
## 1 2020-12-15T22:17:01.559Z
## 2 2020-12-15T22:17:01.559Z
## 3 2020-12-15T22:17:01.559Z
## 4 2020-12-15T22:17:01.560Z
## 5 2020-12-15T22:17:01.560Z
## 6 2020-12-15T22:17:01.560Z
## issue
## 1 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 2 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 3 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 4 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 5 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 6 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID;TAXON_MATCH_FUZZY
## [1] 4517
## [1] 6012
Let’s observe the geographic information contained in the dataset
The raw dataset is extensive, and there are some data which is obviously misplaced.
We must have a clean version of the dataset before we start doing any analyses on it.
Let’s start by removing all fields that have no associated geographic information.
coords <- data.frame("x"=briofCan$decimalLongitude,
"y"=briofCan$decimalLatitude)
# only georeferenced records
briofCan <- briofCan[complete.cases(coords),]
Second, let’s set our ROI within the continental political boundaries of Canada and the United States of America to do some more housekeeping.
# subset coords
# only records coded to CA and US
US_CAN <- countriesLow[countriesLow$SOVEREIGNT %in% c("Canada", "United States of America"),]
# remove empty records
coords <- coords[complete.cases(coords),]
# remove areas with no geographic information
code <- over(SpatialPoints(coords,
proj4string = crs(US_CAN)
),US_CAN)$POSTAL
# add points to the dataset
coords$code <- code
# Subset the dataset based on coords
briofCanClean <- briofCan[coords$code %in% c("CA", "US"),]
# delete unusued levels and subset only based on phylum Bryophyta
briofCanClean <- droplevels(briofCanClean[briofCanClean$phylum == "Bryophyta",])
# clean hawai (to keep records only at the continental shelf)
briofCanClean <- briofCanClean[briofCanClean$decimalLatitude > 30 & briofCanClean$decimalLongitude <0,]
# How many observations are we left with?
length(unique(briofCanClean$speciesKey))
## [1] 1205
## [1] 1681
# let's plot the geographic distribution of the clean data
plot(decimalLatitude~ decimalLongitude,
xlab = "Longitude",
ylab = "Latitude",
pch = 15,
cex = 0.1,
col = "red",
data = briofCanClean,
frame = F)
maps::map(regions = c("USA", "Canada"),fill = F,
col = "black",
add = T)
These are some sources of information on planetary scale geographical limits.
For this tutorial we will define biogeographic regions using a spatial resolution of 1x1 degree. This means that we will consider all species in a grid as separate biological “communities”. It is important to discuss the ecological implications of this assumption. For instance, we can question whether the variation in grids represents accurate representations of the ecological forces defining the coexistence of species into communities. Similarly, we can question whether the biodiversity variance is homogeneous across all grids. For example, within grid variance is expected to be high in biodiversity rich areas such as the tropical rain forest in contrast to biodiversity poor regions such as desserts. For continental scale studies 1x1 degree is often a standard.
## [1] 30.18 82.50
# lets round to 1 degree... very simple way.
briofCanClean$latRound <- round(briofCanClean$decimalLatitude)
briofCanClean$lonRound <- round(briofCanClean$decimalLongitude)
# get a unique id to identify each grid
briofCanClean$unID <- paste(briofCanClean$latRound,
briofCanClean$lonRound,
sep = "_")
# Plot results
plot(briofCanClean$lonRound,
briofCanClean$latRound,
xlab = "Longitude",
ylab = "Latitude",
pch = 15,
cex = 0.4,
col = "firebrick",
frame = F)
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T)
# obtain the residuals from a linear model of the richness as a function of effort
res <- residuals(lm(c(gridTab)~c(gridTab2)))
# reconstitude the sqared matrix with the vector of residuals
resMat <- matrix(res,
nrow = nrow(gridTab),
ncol = ncol(gridTab))
# add names the columns and rows
colnames(resMat) <- colnames(gridTab)
rownames(resMat) <- rownames(gridTab)
plotAmap <- function(gridTab, main ){
dat <- data.frame(
"div"= rowSums(gridTab),
data.frame(
apply(stringr::str_split(rownames(gridTab),
"_", simplify= T),
2,
as.numeric)
)
)
plot(dat$X2, dat$X1,
main = main,
pch = ".",
yaxt = "n",
xaxt = "n",
frame= F,
xlab = "Long",
ylab= "Lat",
col = f(dat$div, 7, "Reds", T),
cex = log(abs(dat$div)))
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T)
}
par(mfrow = c(1,2), mar = c(0,0,4,0))
plotAmap(gridTab, main = "Richness distribution")
plotAmap(resMat, "Richness corrected by effort")
Let’s define a function to compute this Simpson similarity index and apply it to our data to obtain a dissimilarity matrix .
# Function to compute the simpson dissimilarity index among sites in our matrix and return a dissimilarity matrix
# param: SpTab = species-site matrix
getDisMat <- function(SpTab){
# How to measure similarity?
# presence-absence
SpTabPA <-SpTab
SpTabPA[SpTabPA>1] <- 1
# Simpson dissimilarity index
# Calcualate "simpson dissimilarity index" (Bsim part of sorensen dissimilarity)
part <- betapart::beta.pair(SpTabPA,
"sorensen")
# summary(part$beta.sim) ## access the portion of the object that correspond to the Simpson index of dissmilarity
distMat <- part$beta.sim
return(distMat)
}
# apply the function and get thde dissimilarity of the species site matrix
distMat <- getDisMat(gridTab)
Lets observe the variance of the dissimilarity matrix, how many clusters can you observe by eyeballing a heatmap?
# visualize single dimensions in space.
# 1) compute the variance of each grid (i.e matrix rows)
varVec <- sapply(1:nrow(as.matrix(distMat)),
function(x) var(as.matrix(distMat)[,x]))
# 2) define a function that selects the vector of variances as a sampling probability vector to sample rows from the distance matrix and plot the results as a map
plotFocalPoint <- function(distMat, varVec){
# Sample a random focal point from the distance matrix
focalPoint <- data.frame("FP"=
as.matrix(distMat)[,which(varVec == sample(varVec,1))])
# make a dataframe with the distance row-vector and the spatial coordiantes of the grids
distPlot <- data.frame(stringr::str_split(
rownames(focalPoint),
"_",
simplify = T),
"dist" = focalPoint[,1])
# add names to the dataframe
names(distPlot) <- c("lat", "lon", "val")
# make a dataframe with the values of distPlot transformed into numeric
distPlot <- data.frame(
apply(distPlot,2,as.numeric)
)
# plot results
plot(distPlot$lat~distPlot$lon,
xlim = c(-170,-50),
ylim = c(20,90),
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
pch = 15, cex = 0.5)
maps::map(regions = c("USA","Canada"),
fill = T,
col = "white",
add = T)
points(distPlot$lat~distPlot$lon,
pch = 16,
cex = 0.7,
col = scales::alpha(
f(log(distPlot$val+1),
9,
"YlOrRd", T),
1-log(distPlot$val+1)))
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T)
}
# 3) Apply the function iteratively to select i=8 random focal points and plot the results
par(mfrow = c(4,2), mar = c(0,0,0,0), bg = "gray")
for(i in 1:8){
plotFocalPoint(distMat,varVec)
}
We must remember that we deal with incomplete records so there is the potential for non-ecologically relevant sources of variation. For instance, poorly sampled localities may inflate the total sum of pairwise dissimilarities, reducing our ability to disentangle patterns. There is not a magic number of grids or species to preserve but it rather depends to the study objectives, dataset characteristics, and study scale and resolution.
# standardize distances and apply k-means for a range of k.
cascadeKM <- vegan::cascadeKM(vegan::decostand(distMat2,
"hellinger"), # standardization of distances
inf.gr = 2, # lowest k tested
sup.gr = 10) # highest k tested
Lets observe the change on the sum of squares and the similar Calinski criterion after applying the k-means for a range of \(k={2,3,4,5,6,7,8,9,10}\)
KM2 <- cascadeKM$partition[,1]
KM3 <- cascadeKM$partition[,2]
KM4 <- cascadeKM$partition[,3]
KM <- data.frame(KM2,
KM3,
KM4,
apply(stringr::str_split(names(KM2),
pattern = "_",
simplify = T),
2, as.numeric)
)
head(KM)
## KM2 KM3 KM4 X1 X2
## 31_-90 1 3 3 31 -90
## 32_-106 1 3 3 32 -106
## 33_-106 1 3 3 33 -106
## 33_-107 1 3 3 33 -107
## 33_-108 1 2 2 33 -108
## 35_-83 1 3 3 35 -83
par(mfrow = c(2,2), mar = c(1,1,1,1))
plot(KM$X2, KM$X1,
col = f(KM$KM2, 2, "Accent"),
pch = 1,
xaxt = "n",
yaxt = "n",
main = "k = 2",
frame = F,
ylab = "Lat",
xlab = "Long",
cex = 0.35)
## Warning in RColorBrewer::brewer.pal(n, pal): minimal value for n is 3, returning requested palette with 3 different levels
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T )
plot(KM$X2, KM$X1,
col = f(KM$KM3, 3, "Accent"),
pch = 1,
xaxt = "n",
yaxt = "n",
frame = F,
main = "k = 3",
ylab = "Lat",
xlab = "Long",
cex = 0.35)
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T )
plot(KM$X2, KM$X1,
col = f(KM$KM4, 4, "Accent"),
pch = 1,
frame = F,
xaxt = "n",
yaxt = "n",
main = "k = 4",
ylab = "Lat",
xlab = "Long",
cex = 0.35)
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T )
## Run 0 stress 0.3016674
## Run 1 stress 0.3023848
## Run 2 stress 0.3021204
## ... Procrustes: rmse 0.01913395 max resid 0.1064816
## Run 3 stress 0.302536
## Run 4 stress 0.3018606
## ... Procrustes: rmse 0.01759757 max resid 0.1179091
## Run 5 stress 0.3029357
## Run 6 stress 0.3052974
## Run 7 stress 0.3023293
## Run 8 stress 0.3021601
## ... Procrustes: rmse 0.01865829 max resid 0.113893
## Run 9 stress 0.3025779
## Run 10 stress 0.3017371
## ... Procrustes: rmse 0.01062475 max resid 0.09998755
## Run 11 stress 0.3021437
## ... Procrustes: rmse 0.01951051 max resid 0.1241724
## Run 12 stress 0.3024356
## Run 13 stress 0.3023524
## Run 14 stress 0.3021527
## ... Procrustes: rmse 0.02008749 max resid 0.09915881
## Run 15 stress 0.3028218
## Run 16 stress 0.3025918
## Run 17 stress 0.3022332
## Run 18 stress 0.3024596
## Run 19 stress 0.3018232
## ... Procrustes: rmse 0.01452104 max resid 0.1199656
## Run 20 stress 0.302813
## *** No convergence -- monoMDS stopping criteria:
## 4: no. of iterations >= maxit
## 16: stress ratio > sratmax
# Extracting scores
## Scores from PCA
Scor <- vegan::scores(vrda)$sites
# recreate coordinate names
nam <- apply(stringr::str_split(rownames(scores(vrda)$sites),
"_", simplify =T),2,as.numeric)
# build the dataframe
mdsScor <- data.frame(nam, Scor)
par(mfrow = c(1,2), mar = c(0,0,3,0))
cex = 0.7
plot(mdsScor$X1~ mdsScor$X2,
cex = cex,
pch = 15,
xaxt = "n",
yaxt = "n",
xlab = "",
main = "PCA1",
ylab = "",
col = f(log(mdsScor$PC1+2), 5, "Set2"))
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)
plot(mdsScor$X1~ mdsScor$X2,
cex = cex,
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
main = "PCA2",
pch = 15, col = f(log(mdsScor$PC2+7), 5, "Set2"))
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)
# Create a graph
SpGraph <- igraph::graph_from_incidence_matrix(SpTab2)
# Spectrum of adjacency matrix
eiVal <- igraph::spectrum(
SpGraph,
which = list(pos = "LM",
howmany = 3)
)
eiVal <- eiVal$vectors
par(mfrow = c(2,2), mar = c(4,4,2,2))
plot(eiVal[,1], eiVal[,3],
pch = ".", cex = 2,
col = c(rep("red", 466), rep("green", 903)))
plot(eiVal[,1], eiVal[,2],
pch = ".", cex = 2,
col = c(rep("red", 466), rep("green", 903)))
plot(eiVal[,2], eiVal[,3],
pch = ".", cex = 2,
col = c(rep("red", 466), rep("green", 903)))
# Graph laplacian
LapMat <- igraph::laplacian_matrix(SpGraph)
# Laplacian spectra
LapEig <- eigen(LapMat)
dfV <- data.frame(mdsScor, LapEig$vectors[1:dim(SpTab2)[1],][,c(1,2,dim(SpTab2)[1]-1)])
names(dfV) <- c("X1", "X2", "PC1", "PC2", "Lap1", "Lap2", "Lap-2")
par(mfrow = c(2,2), mar = c(1,1,1,1))
cex = 0.35
col <- sign(dfV$`Lap-2`)+2
plot(mdsScor$X1~ mdsScor$X2,
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
frame = F,
main = "Symmetry",
col = col,
pch = 1, cex = cex)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)
plot(mdsScor$X1~ mdsScor$X2,
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
frame = F,
main = "Connectivity",
col = f(dfV$`Lap-2`, 7, "Dark2"),
pch = 1, cex = cex)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)
plot(mdsScor$X1~ mdsScor$X2,
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
frame = F,
main = "Generality",
col = f(log(abs(dfV$Lap1)), 8, "PuOr"),
pch = 1, cex = cex)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)
plot(mdsScor$X1~ mdsScor$X2,
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
frame = F,
main = "Centrality",
col = f(log(abs(dfV$Lap2)), 8, "PuOr"),
pch = 1, cex = cex)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)
WORLDCLIM <-raster::getData("worldclim",
var="bio",
res=10 )
BIO1 <- (raster::intersect(WORLDCLIM$bio1,
SpatialPoints(data.frame(mdsScor$X2,
mdsScor$X1),
proj4string = WORLDCLIM$bio1@crs)))
BIO12 <- (raster::intersect(WORLDCLIM$bio12,
SpatialPoints(data.frame(mdsScor$X2,
mdsScor$X1),
proj4string = WORLDCLIM$bio1@crs)))
climSpace = data.frame(coordinates(BIO1),
"BIO1" = getValues(BIO1))
climSpace2 = data.frame(coordinates(BIO12),
"BIO12" = getValues(BIO12))
climSpace$unID <- paste(round(climSpace$x),
round(climSpace$y),
sep = "_")
climSpace2$unID <- paste(round(climSpace2$x),
round(climSpace2$y),
sep = "_")
climSpace$Bio12 <- climSpace2$BIO12[match(climSpace$unID, climSpace2$unID)]
climSpace <- droplevels(climSpace[complete.cases(climSpace),])
colKM = (KM$KM4[match(climSpace$unID,
paste(KM$X2,
KM$X1,
sep = "_"))])
colCen = mdsScor$E3[match(climSpace$unID,
paste(KM$X2,
KM$X1,
sep = "_"))]
par(mfrow = c(1,2))
plot(scale(climSpace$BIO1),scale(climSpace$Bio12),
col = scales::alpha(f((colKM), 4,"Dark2"),0.5),
cex = 0.7,
pch = "+",
main = "K-groups",
xlab = "Mean annual temperature",
ylab = "Mean annual precipitation",
frame = F)
plot(scale(climSpace$BIO1),scale(climSpace$Bio12),
col = scales::alpha(f((colCen), 7,"PuOr"),0.5),
cex = 0.7,
pch = "+",
main = "3 Eigenvector",
xlab = "Mean annual temperature",
ylab = "Mean annual precipitation",
frame = F)