0.1 Before you start

Make sure you have the following R packages installed

library(igraph)
library(vegan)
library(raster)
library(rworldmap)
library(betapart)
library(RColorBrewer)
library(stringr)
library(maps)


# function to colorize https://gist.github.com/fgabriel1891/2fb219c376ec3e6b33402dc099a7e294

f <- function(x,n=10, pal, rev = F){
  if(rev == F){ 
    rev(RColorBrewer::brewer.pal(n, pal))[cut(x,n)]
  }else{
    (RColorBrewer::brewer.pal(n, pal))[cut(x,n)]
  }
}

Remember that you can install packages in R with install.packages("package name")

0.2 Tutorial objectives

Understand the basis of biogeographical regionalization
Quantify spatial turnover of communities
Deliniate biogeographic regions
Model climate envelopes* (advanced)

1 Biodiversity

Diversity of life in the planet

Dimensions of biodiversity

2 Biogeographic regions

The characterization of geographical areas in terms of biodiversity.
Areas with distinct evolutionary stories.
Comprehensive understanding of the drivers of biological organization

Biogeographic realms

3 Global biodiversity databases

Since early societies, people have been interested to study the distribution of life in the planet. In the illustration age, Carl Linnaeus and other natural historians like Humboldt, Wallace, and Darwin sistematized the study of biodiversity distribution. Since then, centuries of scientific research produced invaluable information on biodiversity. Nowadays, this corpus of biodiversity knowledge is being aggregated into large-scale datasets which are freely available in the internet. For example, digital information infrastructures such as the Global Biodiversity Information Facility (GBIF) increases the accesiblity of data on the distribution of life across the planet. These, digitally available data have various origins such as biodiversity inventories, field research reports, field observations, citizen science, etc.

GBIF

3.1 Format and standards

There are two main ways how the geographic distribution of species is digitally stored in databases.

geocoded species observations

This is a simple format to store information on species distribution. At the minimal level a geocoded point involves tree variables which are the 2 spatial coordinates and the species’ (or other taxonomic category) name. This format is probably the most common way which information in the geography of biodiversity is stored in the globe. This simple format make it easy to store and takes little space in memory. However, relying only on geocoded points has is limitations. For instance the spatial patterns can be strongly because of sampling effort. For poorly sampled species, a single new observation can dramatically increase the extent of the species geographic distribution. Similarly, the sparsity of observations may influence community similarity measures. This problem is particularly accentuated in biodiversity rich and/or poorly sampled regions of the globe such as the tropics, mountains, or near the poles.

Species ranges

The delineation of species geographical ranges is second commonly used format to store information on species distribution. Instead of points, species ranges represent a portion of any representation of the geographical surface of the planet. The delineation is often done based the information on species point data + climate models + expert knowledge. Digital information of species ranges usually come stored as vector polygons or rasters. For instance, a common format is the convex hull, which is the smallest set in space that includes all the species points. More sophisticated methods, such as species distribution modelling uses climatic variation to infer species environmental niches by modelling probabilities of species occurrences in space. Conservation agencies such as the IUCN use this species distribution models along taxonomic and local experts to delimit a more continuous surface to represent the spatial range of a species. Species ranges hold more information than geocoded points at the species level, however biodiversity information at this level of detail is only available for a limited number of taxa.

x	y	sp
36	-15	Sp 5
35	-22	Sp 4
23	-6	Sp 2
32	-10	Sp 2
5	-36	Sp 2
7	-32	Sp 5
31	-14	Sp 2
14	-23	Sp 4
39	-7	Sp 4
29	-13	Sp 2

https://www.iucnredlist.org/species/172220/6852079

https://www.iucnredlist.org/species/173248/6979817

4 Quantifying biodiversity gradients

4.1 The region of interest (R.O.I)

The region of interest corresponds to the space encompassing the sum of the biodiversity total variance. The ROI is the space which we are interested to classify into definable regions based on its biodiversity.

A region of the space V is composed by the set of points in the neighborhood around a point p located at the space V

Depending of the research question or because information limitation, researchers often limit the region of interest to an extent. Boundaries for a region of interest can be placed in geographical space, climatic space, across a temporal axis, and to particular taxonomic groups.

4.2 The species-site matrix.

This matrix holds information on the presence/absence or relative abundances of a set of species throughout a series of localities (sites). In our tutorial, the sites correspond to each of the 1x1 degree grids. For this tutorial we will focus only on the aspect of biodiversity change that comes from the variance of species presences and absences in the ROI. Therefore, entries in the species-site matrix are coded as 1 if a species \(j\) is present at a grid \(n\), and 0 otherwise. Row vectors of this matrix represent the spatial variation of each species across the ROI. Column vectors represent the composition of a site in terms of species.

\[\begin{bmatrix} & site1 & site2 & ... & site_n \\Sp1 & 1 & 0 & ...& 1 \\Sp2 & 0 & 1 & ...& 1 \\Sp3 & 1 & 1 & ...& 0 \\Sp4 & 1 & 0 & ...& 1 \\\vdots & \vdots & \vdots & \vdots & \vdots \\Sp_j & 1 & 1 & 1 & 0 \end{bmatrix}\]

4.2.1 Resolution

Resolution and scale is important to consider before we start to examine biodiversity variation in space. Importantly, because of data limitation at larger scales often is necessary to aggregate point pattern data. The chosen grid resolution both at the area and limits can influence the overall patterns we may be able to identify.

Discussion point:

How different choices of grid resolution affect our observations of spatial patterns in biodiversity?

What is the relationship between resolution and scale?

Resolution

4.3 Community dissimilarities

This differences in species composition is commonly known as species turnover and it is a separate component (ß) of biodiversity. We will compute pairwise dissimilarities in grid species composition. Pairwise dissimilarities in species composition among communities are in function of the number of shared species and the sum of the species diversity of a pairwise set of communities. There are several metrics of community dissimilarity. I encourage you to explore that further in your own. In this tutorial we will use the Simpson dissimilarity index as a measure of species turnover.

\[ S_i = \frac{a}{a + min(b,c)} \]

being \(a\) the number of shared species and \(min(b,c)\) the minimum number of species between two communities.

There are several advantages of Simpson dissimilarity index relative to other indexes. For instance Simpsons uses only the poorest community in the denominator to account for large differences in richness between communities. This means that the index is not biased by strong richness differences between sites, therefore removing the contribution of nestedness to the pairwise dissimilarities

High nestedness

\[\begin{bmatrix} & site1 & site2 & site_3 \\Sp1 & 1 & 1 & 1 \\Sp2 & 1 & 1 & 1 \\Sp3 & 1 & 1 & 0 \\Sp4 & 1 & 1 & 0 \\Sp5 & 1 & 1 & 0 \\Sp6 & 1 & 0 & 0 \\Sp7 & 1 & 0 & 0 \\Sp8 & 1 & 0 & 0 \\Sp9 & 1 & 0 & 0 \end{bmatrix}\]

High turnover

\[\begin{bmatrix} & site1 & site2 & site_3 \\Sp1 & 1 & 1 & 1 \\Sp2 & 1 & 1 & 1 \\Sp3 & 1 & 1 & 0 \\Sp4 & 1 & 1 & 0 \\Sp5 & 0 & 1 & 0 \\Sp6 & 0 & 1 & 0 \\Sp7 & 0 & 1 & 0 \\Sp8 & 0 & 0 & 1 \\Sp9 & 0 & 0 & 1 \end{bmatrix}\]

5 Defining biogeographic regions

Biogeographic regions are defined by the variance of the species dissimilarities among sites. There are two main ways how we can approach this problem of linearizing multidimensional variation in community dissimilarities to define biogeographic regions. We can classify or communities into discrete categories or continuously ordinate them in fewer multivariate dimensions.

5.1 Discrete partitioning

5.1.1 K-means clustering

K-means is a fast clustering algorithm. The k-means algorithm searches to partition the variance of a matrix into groups of k clusters centered around k means.

This is common methodology for un-supervised classification tasks. That is, we let the data to tell us the best partition into separate classes.

steps for the K-means algorithm (from wikipedia)

Standardizing distances

K-means algorithm assumes euclidean distances to compute k clusters. Simpson dissimilarity index index does not project linearly into euclidean space. The Hellinger standardization is a workaround which let’s us use euclidean metrics with community dissimilarity metrics.

\[y_{i,j} = \sqrt{\frac{y_{i,j}}{y_i}}\]

5.1.2 selecting an optimal k

The main limitation of the k-means algorithm is that the cluster association vector is always dependent on the values of k. How do we know the best k then?. This therefore becomes an optimization problem for the value of k.

A solution to find the best k is to iterate the k-means algorithm for a series of k within a range of reasonable estimate for the number of clusters given our community matrix. (e.g k << S or k < S not k = S or k > S)

We optimize the ratio between the variance within k given clusters in relation with the total amount of variation in the matrix. We select the k parameter which the sum of squares of the distances between k clusters (i.e. the distances of the k-centroids to its mean centroid) is closest to the sum of squares of the whole matrix.

5.2 Continous regionalization

5.2.1 Dimensionality reduction

Dimensionality reduction comprise various statistical techniques which looks to fit as much variance possible into the first few of a set of orthogonal linear multivariate vectors. There are various parametric techniques for dimensionality reduction, being Principal component analysis (PCA) among the most widely use. Non-parametric techniques also exist such as the Non-Metric Multidimensional (NMDS), which finds the most parsimonious arrangement into planar space that preserve the community wide pattern of pairwise distances among entries in a matrix.

Principal component analysis

http://ordination.okstate.edu/overview.htm

https://wiki.qcbs.ca/r_workshop9

Non-metric multidimensional scaling NMDS:

http://www.flutterbys.com.au/stats/tut/tut15.1.html

http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf

5.2.2 Spectral partitioning

The species-site matrix can also be conceptualized as a bipartite graph. In this graph, Vertices (nodes) are either species or sites. Edges (links) connect vertices based on the occurrence of a given species in a site. This is called a bipartite graph because edges connect nodes that belong to a different set. That is, edges in this graph exist between species and sites but not among species, nor sites. A graph representation opens up various possibilities to examine the structure of species turnover across space.

Incidency matrix

\[ G = \begin{bmatrix} & site1 & site2 & site3 & site_3 \\Sp1 & 1 & 0 & 1& 1 \\Sp2 & 0 & 1 & 0& 1 \\Sp3 & 1 & 1 & 1& 0 \\Sp4 & 1 & 0 & 1 & 1 \\Sp_5 & 1 & 1 & 1 & 0 \end{bmatrix}\]

Bipartite graph

## Warning in igraph::layout.bipartite(grah): vertex types converted to logical

Adjacency matrix = A

	Species1	Species3	Species4	Species5	Species2	Site1	Site2	Site3	Site4
Species1	0	0	0	0	0	1	0	1	1
Species3	0	0	0	0	0	1	1	1	0
Species4	0	0	0	0	0	1	0	1	1
Species5	0	0	0	0	0	1	1	1	0
Species2	0	0	0	0	0	0	1	0	1
Site1	1	1	1	1	0	0	0	0	0
Site2	0	1	0	1	1	0	0	0	0
Site3	1	1	1	1	0	0	0	0	0
Site4	1	0	1	0	1	0	0	0	0

Degree matrix = D

	Species1	Species2	Species3	Species4	Site1	Site2	Site3	Site4	Site5
Site1	3	0	0	0	0	0	0	0	0
Site2	0	3	0	0	0	0	0	0	0
Site3	0	0	3	0	0	0	0	0	0
Site4	0	0	0	3	0	0	0	0	0
Site5	0	0	0	0	2	0	0	0	0
Species1	0	0	0	0	0	4	0	0	0
Species2	0	0	0	0	0	0	3	0	0
Species3	0	0	0	0	0	0	0	4	0
Species4	0	0	0	0	0	0	0	0	3

Laplacian matrix = L

\[ L = D-A \]

	Species1	Species3	Species4	Species5	Species2	Site1	Site2	Site3	Site4
Species1	3	0	0	0	0	-1	0	-1	-1
Species3	0	3	0	0	0	-1	-1	-1	0
Species4	0	0	3	0	0	-1	0	-1	-1
Species5	0	0	0	3	0	-1	-1	-1	0
Species2	0	0	0	0	2	0	-1	0	-1
Site1	-1	-1	-1	-1	0	4	0	0	0
Site2	0	-1	0	-1	-1	0	3	0	0
Site3	-1	-1	-1	-1	0	0	0	4	0
Site4	-1	0	-1	0	-1	0	0	0	3

Spectral profile of a graph

\[ Av= \lambda v\]

6 Case study

Bryophytes are non-vascular plants that occur in many habitats across the globe. Bryophytes include liverworts, mosses, and hornworts. Bryophytes have been sampled extensively in various regions of the Atlantic and Pacific zones of Canada and the United States. For this tutorial we will quantify the spatial turnover of bryophyte communities across the US and Canada at 1ª resolution. Recently, the Beaty Museum of the University of British Columbia made available the information on their entire collection of bryophyte occurrences. The dataset is freely available through GBIF.

This dataset comes from the bryophyte collections of the University of British Columbia.

https://beatymuseum.ubc.ca/research-2/collections/herbarium/herbarium-bryophytes/

6.1 Loading the dataset

Let’s load the dataset, which is currently in the data folder stored as a .csv file.

briofCan <- read.csv("data/BriophitesCAN.csv", 
                     header = T,
                     sep = "\t") # note we are specifing a tab-delimited file. 

head(briofCan) # first rows

##       gbifID                           datasetKey
## 1 1987794314 4edd9396-59df-4b01-9e29-dc21a59f9963
## 2 1987792636 4edd9396-59df-4b01-9e29-dc21a59f9963
## 3 1987792129 4edd9396-59df-4b01-9e29-dc21a59f9963
## 4 1987793965 4edd9396-59df-4b01-9e29-dc21a59f9963
## 5 1987793938 4edd9396-59df-4b01-9e29-dc21a59f9963
## 6 1987792195 4edd9396-59df-4b01-9e29-dc21a59f9963
##                           occurrenceID kingdom          phylum
## 1 A77D8408-E621-B14B-ACC8-022FA5867C3B Plantae       Bryophyta
## 2 A7856BA9-045B-D940-9EF8-86F982F61919 Plantae       Bryophyta
## 3 A78F39BC-D010-F747-B245-24B9E067AFD3 Plantae       Bryophyta
## 4 A796CD2C-24D6-2C46-89DD-E4A5E9B51738 Plantae       Bryophyta
## 5 A798A3E3-65C6-1948-A621-B5ACA58AC3D7 Plantae Marchantiophyta
## 6 A79A61B8-6810-4243-9CA6-A7EC4872D999 Plantae       Bryophyta
##               class           order          family         genus
## 1         Bryopsida      Grimmiales     Grimmiaceae     Dryptodon
## 2     Andreaeopsida     Andreaeales    Andreaeaceae      Andreaea
## 3         Bryopsida      Grimmiales     Grimmiaceae     Dryptodon
## 4         Bryopsida      Dicranales     Dicranaceae     Pilopogon
## 5 Jungermanniopsida Jungermanniales Plagiochilaceae   Plagiochila
## 6         Bryopsida       Pottiales      Pottiaceae Hymenostylium
##                        species infraspecificEpithet  taxonRank
## 1             Dryptodon patens                         SPECIES
## 2           Andreaea rupestris            rupestris SUBSPECIES
## 3             Dryptodon patens                         SPECIES
## 4      Pilopogon guadalupensis                         SPECIES
## 5    Plagiochila schofieldiana                         SPECIES
## 6 Hymenostylium recurvirostrum                         SPECIES
##                                       scientificName
## 1                      Dryptodon patens Bridel, 1826
## 2                Andreaea rupestris subsp. rupestris
## 3                      Dryptodon patens Bridel, 1826
## 4                   Pilopogon gracilis (Hook.) Brid.
## 5                    Plagiochila schofieldiana Inoue
## 6 Bryoerythrophyllum recurvirostrum (Hedw.) P.C.Chen
##                          verbatimScientificName
## 1      Dryptodon patens (Dicks. ex Hedw.) Brid.
## 2     Andreaea rupestris Hedw. subsp. rupestris
## 3      Dryptodon patens (Dicks. ex Hedw.) Brid.
## 4              Pilopogon gracilis (Hook.) Brid.
## 5            Plagiochila schofieldiana H. Inoue
## 6 Bryoerythrophyllum recurvirostre (Hedw.) Chen
##   verbatimScientificNameAuthorship countryCode
## 1          (Dicks. ex Hedw.) Brid.          CA
## 2                                           CA
## 3          (Dicks. ex Hedw.) Brid.          US
## 4                    (Hook.) Brid.          PA
## 5                         H. Inoue          US
## 6                     (Hedw.) Chen          CA
##                                                                      locality
## 1                Paulson Bridge on Rte 3 between Castlegar and Christina Lake
## 2      Princess Royal Island, Chapele Inlet; stream canyon near mouth, E side
## 3                  Adak Quadrangle, Adak Is., Aleutian Islands, Mt. Reed; Co:
## 4 cerro colorado, 4.3 i. above chami camp. roadside along the ridge, ca 3 mi.
## 5     NW of Alexai Point, S slope of Gilbert Ridge, Attu Island, Aleutian Is.
## 6                           Canoe Lake, S slope of ridge E of Canoe Lake; Co:
##           stateProvince occurrenceStatus individualCount
## 1      British Columbia          PRESENT              NA
## 2      British Columbia          PRESENT              NA
## 3                Alaska          PRESENT              NA
## 4        Bocas Del Toro          PRESENT              NA
## 5                Alaska          PRESENT              NA
## 6 Northwest Territories          PRESENT              NA
##                       publishingOrgKey decimalLatitude decimalLongitude
## 1 b542788f-0dc2-4a2b-b652-fceced449591            49.2           -118.1
## 2 b542788f-0dc2-4a2b-b652-fceced449591            52.9           -129.1
## 3 b542788f-0dc2-4a2b-b652-fceced449591            51.8           -176.7
## 4 b542788f-0dc2-4a2b-b652-fceced449591             8.6            -81.8
## 5 b542788f-0dc2-4a2b-b652-fceced449591            52.9            173.2
## 6 b542788f-0dc2-4a2b-b652-fceced449591            68.2           -135.9
##   coordinateUncertaintyInMeters coordinatePrecision elevation elevationAccuracy
## 1                          7303                  NA        NA                  
## 2                          6854                  NA        NA                  
## 3                          7006                  NA        NA                  
## 4                         11024                  NA        NA                  
## 5                          6854                  NA        NA                  
## 6                          4170                  NA        NA                  
##   depth depthAccuracy           eventDate day month year taxonKey speciesKey
## 1    NA            NA 1977-08-05T00:00:00   5     8 1977  5281814    5281814
## 2    NA            NA 1986-08-09T00:00:00   9     8 1986  7347162    5283962
## 3    NA            NA 1986-08-06T00:00:00   6     8 1986  5281814    5281814
## 4    NA            NA 1986-06-21T00:00:00  21     6 1986  8232977    2675529
## 5    NA            NA 2000-08-12T00:00:00  12     8 2000  8003542    8003542
## 6    NA            NA 1965-06-20T00:00:00  20     6 1965  2670498    7347186
##        basisOfRecord institutionCode collectionCode catalogNumber  recordNumber
## 1 PRESERVED_SPECIMEN          ca.ubc  UBCBryophytes        B18100         67210
## 2 PRESERVED_SPECIMEN          ca.ubc  UBCBryophytes       B107037         86589
## 3 PRESERVED_SPECIMEN          ca.ubc  UBCBryophytes       B207029      1986-187
## 4 PRESERVED_SPECIMEN          ca.ubc  UBCBryophytes       B110795          5341
## 5 PRESERVED_SPECIMEN          ca.ubc  UBCBryophytes       B185157        115694
## 6 PRESERVED_SPECIMEN          ca.ubc  UBCBryophytes        B81228 plot no. 31-6
##       identifiedBy dateIdentified license                   rightsHolder
## 1                                 CC0_1_0 University of British Columbia
## 2 B.M. Murray,1988                CC0_1_0 University of British Columbia
## 3   W.B. Schofield                CC0_1_0 University of British Columbia
## 4                                 CC0_1_0 University of British Columbia
## 5   W.B. Schofield                CC0_1_0 University of British Columbia
## 6     O. Lee, 1985                CC0_1_0 University of British Columbia
##                    recordedBy typeStatus establishmentMeans
## 1   W.B. Schofield, G.F. Otto                              
## 2              W.B. Schofield                              
## 3           Stephen S. Talbot                              
## 4                  B.H. Allen                              
## 5 W.B. Schofield, S.S. Talbot                              
## 6     J. Lambert, D. Morrison                              
##            lastInterpreted mediaType
## 1 2020-12-15T22:17:01.559Z          
## 2 2020-12-15T22:17:01.559Z          
## 3 2020-12-15T22:17:01.559Z          
## 4 2020-12-15T22:17:01.560Z          
## 5 2020-12-15T22:17:01.560Z          
## 6 2020-12-15T22:17:01.560Z          
##                                                                   issue
## 1                   GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 2                   GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 3                   GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 4                   GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 5                   GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 6 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID;TAXON_MATCH_FUZZY

length(unique(briofCan$speciesKey)) # how many unique species names

## [1] 4517

length(unique(briofCan$scientificName)) # how many unique species codes

## [1] 6012

Let’s observe the geographic information contained in the dataset

plot(briofCan$decimalLatitude~ briofCan$decimalLongitude, 
     pch = ".", 
     frame = F,
     main = "Bryophytes recorded in dataset",
     xlab = "Long",
     ylab = "Lat")

6.2 Dataset housekeeping

The raw dataset is extensive, and there are some data which is obviously misplaced.

We must have a clean version of the dataset before we start doing any analyses on it.

Let’s start by removing all fields that have no associated geographic information.

coords <-  data.frame("x"=briofCan$decimalLongitude, 
                      "y"=briofCan$decimalLatitude)
# only georeferenced records 
briofCan <- briofCan[complete.cases(coords),]

Second, let’s set our ROI within the continental political boundaries of Canada and the United States of America to do some more housekeeping.

# subset coords

# only records coded to CA and US 
US_CAN <- countriesLow[countriesLow$SOVEREIGNT %in% c("Canada", "United States of America"),]
# remove empty records
coords <- coords[complete.cases(coords),]
# remove areas with no geographic information
code <- over(SpatialPoints(coords, 
                           proj4string = crs(US_CAN)
                           ),US_CAN)$POSTAL
# add points to the dataset
coords$code <- code 
# Subset the dataset based on coords
briofCanClean <- briofCan[coords$code %in% c("CA", "US"),]
# delete unusued levels and subset only based on phylum Bryophyta
briofCanClean <- droplevels(briofCanClean[briofCanClean$phylum == "Bryophyta",])

# clean hawai (to keep records only at the continental shelf)
briofCanClean <- briofCanClean[briofCanClean$decimalLatitude > 30 & briofCanClean$decimalLongitude <0,]

# How many observations are we left with?
length(unique(briofCanClean$speciesKey))

## [1] 1205

length(unique(briofCanClean$scientificName))

## [1] 1681

# let's plot the geographic distribution of the clean data
plot(decimalLatitude~ decimalLongitude, 
     xlab = "Longitude",
     ylab = "Latitude",
     pch =  15,
     cex = 0.1,
     col = "red",
     data = briofCanClean,
     frame = F)
maps::map(regions = c("USA", "Canada"),fill = F,
          col = "black",
          add = T)

6.2.1 Where can I get data to limit the R.O.I. ?

These are some sources of information on planetary scale geographical limits.

6.3 Aggregating species occurrence to grids in geographical space

For this tutorial we will define biogeographic regions using a spatial resolution of 1x1 degree. This means that we will consider all species in a grid as separate biological “communities”. It is important to discuss the ecological implications of this assumption. For instance, we can question whether the variation in grids represents accurate representations of the ecological forces defining the coexistence of species into communities. Similarly, we can question whether the biodiversity variance is homogeneous across all grids. For example, within grid variance is expected to be high in biodiversity rich areas such as the tropical rain forest in contrast to biodiversity poor regions such as desserts. For continental scale studies 1x1 degree is often a standard.

range(briofCanClean$decimalLatitude)

## [1] 30.18 82.50

# lets round to 1 degree... very simple way. 

briofCanClean$latRound <- round(briofCanClean$decimalLatitude)
briofCanClean$lonRound <- round(briofCanClean$decimalLongitude)
# get a unique id to identify each grid
briofCanClean$unID <- paste(briofCanClean$latRound,
                            briofCanClean$lonRound, 
                            sep = "_")

# Plot results 
plot(briofCanClean$lonRound,
     briofCanClean$latRound, 
     xlab = "Longitude",
     ylab = "Latitude",
     pch =  15,
     cex = 0.4,
     col = "firebrick", 
     frame = F)
maps::map(regions = c("USA", "Canada"),
          fill = F,
          col = "black",
          add = T)

6.3.1 Species-site matrix

# Build the species site matrix
gridTab <- table(briofCanClean$unID,
                 briofCanClean$speciesKey )
# separate effort vs presence absence. 
gridTab2 <- gridTab # sampling effort
gridTab[gridTab > 1 ] <- 1 # p/a

6.3.2 Richness distribution and sampling biases

# obtain the residuals from a linear model of the richness as a function of effort
res <- residuals(lm(c(gridTab)~c(gridTab2)))
# reconstitude the  sqared matrix with the vector of residuals
resMat <- matrix(res,
                 nrow = nrow(gridTab),
                 ncol =  ncol(gridTab))
# add names the columns and rows
colnames(resMat) <- colnames(gridTab)
rownames(resMat) <- rownames(gridTab)

plotAmap <- function(gridTab, main ){
  
  dat <- data.frame(
    "div"= rowSums(gridTab),
    data.frame(
      apply(stringr::str_split(rownames(gridTab),
                               "_", simplify= T),
            2,
            as.numeric)
    )
  )
   
   
   plot(dat$X2, dat$X1, 
        main = main,
        pch = ".",
        yaxt = "n",
        xaxt = "n",
        frame= F,
        xlab = "Long", 
        ylab= "Lat",
        col = f(dat$div, 7, "Reds", T),
        cex = log(abs(dat$div)))
   
   maps::map(regions = c("USA", "Canada"),
          fill = F,
          col = "black",
          add = T)
    
}

par(mfrow = c(1,2), mar = c(0,0,4,0))
plotAmap(gridTab, main = "Richness distribution")
plotAmap(resMat, "Richness corrected by effort")

6.4 Community dissimilarities

Let’s define a function to compute this Simpson similarity index and apply it to our data to obtain a dissimilarity matrix .

# Function to compute the simpson dissimilarity index among sites in our matrix and return a dissimilarity matrix 
# param: SpTab = species-site matrix 

getDisMat <- function(SpTab){
  # How to measure similarity? 
  # presence-absence 
  SpTabPA <-SpTab
  SpTabPA[SpTabPA>1] <- 1
  
  # Simpson dissimilarity index 
  
  # Calcualate "simpson dissimilarity index" (Bsim part of sorensen dissimilarity)
  part <- betapart::beta.pair(SpTabPA,
                              "sorensen")
  # summary(part$beta.sim) ##  access the portion of the object that correspond to the Simpson index of dissmilarity 
  
  distMat <- part$beta.sim
  
  return(distMat)
  
  
}

# apply the function and get thde dissimilarity of the species site matrix
distMat <- getDisMat(gridTab)

Lets observe the variance of the dissimilarity matrix, how many clusters can you observe by eyeballing a heatmap?

# plot a heatmap of the distance matrix
heatmap(as.matrix(distMat))

6.5 Dissimilarity decay fields

# visualize single dimensions in space. 


# 1) compute the variance of each grid (i.e matrix rows)

varVec <- sapply(1:nrow(as.matrix(distMat)),
                 function(x) var(as.matrix(distMat)[,x]))

# 2) define a function that selects the vector of variances as a sampling probability vector to sample rows from the distance matrix and plot the results as a map   

plotFocalPoint <- function(distMat, varVec){
  # Sample a random focal point from the distance matrix
  focalPoint <- data.frame("FP"= 
                             as.matrix(distMat)[,which(varVec == sample(varVec,1))])
  
  # make a dataframe with the distance row-vector and the spatial coordiantes of the grids
  distPlot <- data.frame(stringr::str_split(
    rownames(focalPoint), 
    "_",
    simplify = T),
    "dist" =  focalPoint[,1])
  # add names to the dataframe
  names(distPlot) <- c("lat", "lon", "val")
  # make a dataframe with the values of distPlot transformed into numeric
  distPlot <- data.frame(
    apply(distPlot,2,as.numeric)
  )

  # plot results
  plot(distPlot$lat~distPlot$lon,
       xlim = c(-170,-50),
       ylim  = c(20,90), 
       xaxt = "n", 
       yaxt = "n", 
       xlab = "",
       ylab  = "",
       pch = 15, cex = 0.5)
  maps::map(regions = c("USA","Canada"),
            fill = T,
            col = "white",
            add = T)
  
  points(distPlot$lat~distPlot$lon,
         pch = 16, 
         cex = 0.7,
         col = scales::alpha(
           f(log(distPlot$val+1), 
             9,
             "YlOrRd", T), 
           1-log(distPlot$val+1)))
  maps::map(regions = c("USA", "Canada"),
            fill = F,
            col = "black",
            add = T)
  
  
}

# 3) Apply the function iteratively to select i=8 random focal points and plot the results  

par(mfrow = c(4,2), mar = c(0,0,0,0), bg = "gray")

for(i in 1:8){
  plotFocalPoint(distMat,varVec)
}

6.5.1 removing potential sources of noise

We must remember that we deal with incomplete records so there is the potential for non-ecologically relevant sources of variation. For instance, poorly sampled localities may inflate the total sum of pairwise dissimilarities, reducing our ability to disentangle patterns. There is not a magic number of grids or species to preserve but it rather depends to the study objectives, dataset characteristics, and study scale and resolution.

# remove species and grids with little information 

SpTab2  <- gridTab[rowSums(gridTab) > 10,
                 -which(colSums(gridTab) 
                        %in% c(1:2))]
# compute the distance matrix of the new prunned matrix
distMat2 <- getDisMat(SpTab2)

6.6 Discrete partitioning

# standardize distances and apply k-means for a range of k. 
cascadeKM <- vegan::cascadeKM(vegan::decostand(distMat2, 
                                               "hellinger"), # standardization of distances
                              inf.gr = 2, # lowest k tested
                              sup.gr = 10) # highest k tested

Lets observe the change on the sum of squares and the similar Calinski criterion after applying the k-means for a range of \(k={2,3,4,5,6,7,8,9,10}\)

KM2 <- cascadeKM$partition[,1]
KM3 <- cascadeKM$partition[,2]
KM4 <- cascadeKM$partition[,3]
KM <- data.frame(KM2,
                 KM3,
                 KM4,
                 apply(stringr::str_split(names(KM2),
                                          pattern = "_", 
                                          simplify = T),
                       2, as.numeric)
                 )
head(KM)

##         KM2 KM3 KM4 X1   X2
## 31_-90    1   3   3 31  -90
## 32_-106   1   3   3 32 -106
## 33_-106   1   3   3 33 -106
## 33_-107   1   3   3 33 -107
## 33_-108   1   2   2 33 -108
## 35_-83    1   3   3 35  -83

par(mfrow = c(2,2), mar = c(1,1,1,1))
plot(KM$X2, KM$X1,
     col = f(KM$KM2, 2, "Accent"),
     pch = 1,
     xaxt = "n",
     yaxt = "n",
     main = "k = 2",
     frame = F,
     ylab = "Lat",
     xlab = "Long",
     cex = 0.35)

## Warning in RColorBrewer::brewer.pal(n, pal): minimal value for n is 3, returning requested palette with 3 different levels

maps::map(regions = c("USA", "Canada"),
          fill = F,
          col = "black",
          add = T )


plot(KM$X2, KM$X1,
     col = f(KM$KM3, 3, "Accent"),
     pch = 1,
     xaxt = "n",
     yaxt = "n",
     frame = F,
     main = "k = 3",
     ylab = "Lat",
     xlab = "Long",
     cex = 0.35)
maps::map(regions = c("USA", "Canada"),
          fill = F,
          col = "black",
          add = T )


plot(KM$X2, KM$X1,
     col = f(KM$KM4, 4, "Accent"),
     pch = 1,
     frame = F,
     xaxt = "n",
     yaxt = "n",
     main = "k = 4",
     ylab = "Lat",
     xlab = "Long",
     cex = 0.35)
maps::map(regions = c("USA", "Canada"),
          fill = F,
          col = "black",
          add = T )

6.7 Dimensionality reduction

# PCA
vrda <- vegan::rda(SpTab2)
# NMDS 
mds <- vegan::metaMDS(distMat2)

## Run 0 stress 0.3016674 
## Run 1 stress 0.3023848 
## Run 2 stress 0.3021204 
## ... Procrustes: rmse 0.01913395  max resid 0.1064816 
## Run 3 stress 0.302536 
## Run 4 stress 0.3018606 
## ... Procrustes: rmse 0.01759757  max resid 0.1179091 
## Run 5 stress 0.3029357 
## Run 6 stress 0.3052974 
## Run 7 stress 0.3023293 
## Run 8 stress 0.3021601 
## ... Procrustes: rmse 0.01865829  max resid 0.113893 
## Run 9 stress 0.3025779 
## Run 10 stress 0.3017371 
## ... Procrustes: rmse 0.01062475  max resid 0.09998755 
## Run 11 stress 0.3021437 
## ... Procrustes: rmse 0.01951051  max resid 0.1241724 
## Run 12 stress 0.3024356 
## Run 13 stress 0.3023524 
## Run 14 stress 0.3021527 
## ... Procrustes: rmse 0.02008749  max resid 0.09915881 
## Run 15 stress 0.3028218 
## Run 16 stress 0.3025918 
## Run 17 stress 0.3022332 
## Run 18 stress 0.3024596 
## Run 19 stress 0.3018232 
## ... Procrustes: rmse 0.01452104  max resid 0.1199656 
## Run 20 stress 0.302813 
## *** No convergence -- monoMDS stopping criteria:
##      4: no. of iterations >= maxit
##     16: stress ratio > sratmax

par(mfrow = c(1,2))
plot(mds)
plot(vrda)

# Extracting scores 
## Scores from PCA
Scor <- vegan::scores(vrda)$sites
# recreate coordinate names
nam <- apply(stringr::str_split(rownames(scores(vrda)$sites),
                                "_", simplify =T),2,as.numeric)
# build the dataframe
mdsScor <- data.frame(nam, Scor)

par(mfrow = c(1,2), mar = c(0,0,3,0))
cex = 0.7
plot(mdsScor$X1~ mdsScor$X2,
     cex = cex,
     pch = 15,
     xaxt = "n",
     yaxt = "n",
     xlab = "",
     main = "PCA1",
     ylab  = "",
     col = f(log(mdsScor$PC1+2), 5, "Set2"))
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
          add = T)
plot(mdsScor$X1~ mdsScor$X2,
     cex = cex,
     xaxt = "n",
     yaxt = "n",
     xlab = "",
     ylab  = "",
     main = "PCA2",
     pch = 15, col = f(log(mdsScor$PC2+7), 5, "Set2"))
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
          add = T)

6.8 Spectral partitioning

# Create a graph 
SpGraph <- igraph::graph_from_incidence_matrix(SpTab2)


# Spectrum of adjacency matrix

eiVal <- igraph::spectrum(
  SpGraph,
  which = list(pos = "LM", 
               howmany = 3)
  )

eiVal <- eiVal$vectors

par(mfrow = c(2,2), mar = c(4,4,2,2))
plot(eiVal[,1], eiVal[,3],
     pch = ".", cex = 2,
     col = c(rep("red", 466), rep("green", 903)))
plot(eiVal[,1], eiVal[,2], 
          pch = ".", cex = 2,
     col = c(rep("red", 466), rep("green", 903)))
plot(eiVal[,2], eiVal[,3], 
          pch = ".", cex = 2,
     col = c(rep("red", 466), rep("green", 903)))

# Graph laplacian 

LapMat <- igraph::laplacian_matrix(SpGraph)

# Laplacian spectra

LapEig <- eigen(LapMat)

dfV <- data.frame(mdsScor, LapEig$vectors[1:dim(SpTab2)[1],][,c(1,2,dim(SpTab2)[1]-1)])
names(dfV) <- c("X1", "X2", "PC1", "PC2", "Lap1", "Lap2", "Lap-2")

par(mfrow = c(2,2), mar = c(1,1,1,1))

cex = 0.35
col <- sign(dfV$`Lap-2`)+2

plot(mdsScor$X1~ mdsScor$X2,
     xaxt = "n",
     yaxt = "n",
     xlab = "",
     ylab  = "",
     frame = F,
     main  = "Symmetry",
     col = col, 
     pch = 1, cex = cex)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
          add = T)
plot(mdsScor$X1~ mdsScor$X2,
     xaxt = "n",
     yaxt = "n",
     xlab = "",
     ylab  = "",
      frame = F,
     main  = "Connectivity",
     col = f(dfV$`Lap-2`, 7, "Dark2"),
     pch = 1, cex = cex)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
          add = T)


plot(mdsScor$X1~ mdsScor$X2,
     xaxt = "n",
     yaxt = "n",
     xlab = "",
     ylab  = "",
      frame = F,
     main  = "Generality",
     col = f(log(abs(dfV$Lap1)), 8, "PuOr"),
     pch = 1, cex = cex)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
          add = T)

plot(mdsScor$X1~ mdsScor$X2,
     xaxt = "n",
     yaxt = "n",
     xlab = "",
     ylab  = "",
      frame = F,
     main  = "Centrality",
     col = f(log(abs(dfV$Lap2)), 8, "PuOr"),
     pch = 1, cex = cex)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
          add = T)

7 Biogeographic regionalization in climatic space

WORLDCLIM  <-raster::getData("worldclim",
                             var="bio",
                             res=10 )
BIO1 <- (raster::intersect(WORLDCLIM$bio1,
                           SpatialPoints(data.frame(mdsScor$X2,
                                                    mdsScor$X1), 
                                         proj4string = WORLDCLIM$bio1@crs)))

BIO12 <- (raster::intersect(WORLDCLIM$bio12,
                            SpatialPoints(data.frame(mdsScor$X2,
                                                     mdsScor$X1), 
                                          proj4string = WORLDCLIM$bio1@crs)))

climSpace = data.frame(coordinates(BIO1),
                       "BIO1" = getValues(BIO1))
climSpace2 = data.frame(coordinates(BIO12),
                        "BIO12" = getValues(BIO12))
climSpace$unID <- paste(round(climSpace$x),
                        round(climSpace$y),
                        sep = "_")
climSpace2$unID <- paste(round(climSpace2$x),
                         round(climSpace2$y), 
                         sep = "_")

climSpace$Bio12 <- climSpace2$BIO12[match(climSpace$unID, climSpace2$unID)]
climSpace <- droplevels(climSpace[complete.cases(climSpace),])

colKM  = (KM$KM4[match(climSpace$unID, 
                       paste(KM$X2,
                             KM$X1,
                             sep = "_"))])
colCen  = mdsScor$E3[match(climSpace$unID, 
                       paste(KM$X2,
                             KM$X1,
                             sep = "_"))]

par(mfrow = c(1,2))
plot(scale(climSpace$BIO1),scale(climSpace$Bio12),
   col = scales::alpha(f((colKM), 4,"Dark2"),0.5),
   cex = 0.7,
   pch = "+",
   main = "K-groups",
     xlab = "Mean annual temperature",
     ylab = "Mean annual precipitation",
     frame = F)

plot(scale(climSpace$BIO1),scale(climSpace$Bio12),
   col = scales::alpha(f((colCen), 7,"PuOr"),0.5),
   cex = 0.7,
   pch = "+",
   main = "3 Eigenvector",
     xlab = "Mean annual temperature",
     ylab = "Mean annual precipitation",
     frame = F)

Carter et al., 2016. Endemic regions of mosses in North America

Biogeographic regions, practical tutorial

Gabriel Muñoz

26/01/2021