Mapping Species Richness: Integrating Occurrence Data, Climatic Variables, and Phylogenetic Insights in a Global Grid Analysis - A minimal example

Author
Affiliation

Gabriel Munoz-Acevedo

Published

March 9, 2025

Abstract

In this notebook, there is a minimal tutorial to a spatial analysis pipeline that maps species occurrence points to WorldClim variables and overlays them on a global grid to quantify species richness per cell. There is also examples about using GLM, GLMM, and phylogenetic regression, to examine how climate variation to estimate an ordinal measure of ant polymorphism.

Keywords

Ant polymorphism, Phylogenetic regression, Global biodiversity, Ecological complexity, Quantitative ecology

1 Getting started

Before you start:

  • Make sure you have the latest version of R installed.

  • Open R in any IDE of your choosing (Rstudio, VScode, Jupyter, etc… )

1.1 Clone the repository

  • Clone the GitHub repository
How to clone a repository from gitHub
  1. Go to the GitHub repository page: https://github.com/lessardlab/Mapping-Ant-Species-Richness

  2. Copy the https link of the repository: https://github.com/lessardlab/Mapping-Ant-Species-Richness.git

  3. Open the terminal in your computer

  4. Set the directory where you want to clone the repository

  5. Type git clone https://github.com/lessardlab/Mapping-Ant-Species-Richness.git

  6. Hit enter

  7. Done! You have cloned the repository in your computer.

Important

Make sure to have Quarto installed if you go this route. For Rstudio users, Quarto comes preinstalled, for VScode and others, you need to download the Quarto extension.

2 Dependencies

To replicate this tutorial, make sure you have the following packages. To install a package, use install.packages('package_name') (Note you need to do it only once)

# install.packages('duckplyr')
# install.packages('geodata')

library(tidyverse) # for data manipulation
library(duckplyr) # for fast data processing 
library(phytools) # for phylogenetic regression
library(lme4) # for linear models 
library(rnaturalearth)
library(sf)
library(raster)

3 Sourcing data

For this tutorial we will use the ant polymorphism database publised as part of the article: LaRichelliere et al., 2023. Warm regions of the world are hotspots of superorganism complexity

The dataset is open and public. You can download your own copy of the data by cloning the paper GitHub repository: https://github.com/lessardlab/GlobalPolyMorp

# Source data on global ant polymorphism.

my_ant_data <- duckplyr_df_from_csv("Lat-Long_Data_GABI.csv")

summary(my_ant_data)
duckplyr: materializing
 gabi_acc_number    valid_species_name   country             dec_lat       
 Length:743211      Length:743211      Length:743211      Min.   :-55.083  
 Class :character   Class :character   Class :character   1st Qu.: -8.367  
 Mode  :character   Mode  :character   Mode  :character   Median : 14.481  
                                                          Mean   : 14.816  
                                                          3rd Qu.: 37.845  
                                                          Max.   : 88.416  
                                                                           
    dec_long         elevation      bentity2_name     
 Min.   :-180.00   Min.   : -80.0   Length:743211     
 1st Qu.: -85.02   1st Qu.: 130.0   Class :character  
 Median : -59.64   Median : 500.0   Mode  :character  
 Mean   : -19.46   Mean   : 669.8                     
 3rd Qu.:  34.87   3rd Qu.:1090.0                     
 Max.   : 179.97   Max.   :5300.0                     
 NA's   :151       NA's   :340610                     
head(my_ant_data)
duckplyr: materializing
# A tibble: 6 × 7
  gabi_acc_number valid_species_name        country dec_lat dec_long elevation
  <chr>           <chr>                     <chr>     <dbl>    <dbl>     <dbl>
1 GABI_01146836   Myrmecocystus.semirufus   USA        36.5    -117.       -80
2 GABI_01026649   Myrmecocystus.semirufus   USA        36.5    -117.       -80
3 GABI_01033314   Myrmecocystus.semirufus   USA        36.5    -117.       -80
4 GABI_01174725   Myrmecocystus.semirufus   USA        36.5    -117.       -80
5 GABI_01130976   Pogonomyrmex.californicus USA        36.5    -117.       -80
6 GABI_01032969   Pogonomyrmex.californicus USA        36.5    -117.       -80
# ℹ 1 more variable: bentity2_name <chr>

Let’s start with tidying the dataset. For instance, we can separate the Genus, species, and species name

my_ant_data <- 
  my_ant_data |> 
  mutate(Genus = str_extract(valid_species_name, "^([^.]+)"),
         species = str_extract(valid_species_name, "([^.]+)$"), 
         species_name_no_dot = str_replace(valid_species_name, "\\.", " "))

my_ant_data |>
  duckplyr::select(Genus, species, species_name_no_dot ) |> 
  DT::datatable()
duckplyr: materializing

For this tutorial, we will focus on the genus Cataglyphis. Let’s filter the dataset

my_catagliphis_data <- my_ant_data |>
  filter(Genus == 'Cataglyphis')


dim(my_catagliphis_data) # 1765 observations 
duckplyr: materializing
[1] 1765   10

Let us now create unique point identifiers

my_catagliphis_data <- 
  my_catagliphis_data |> 
  mutate(unique_id = paste(dec_long, dec_lat))

Let us now do some basic crosstabulations to get an impression of our data subset

How many unique species?

my_catagliphis_data |>
  distinct(valid_species_name) |> 
  count()
duckplyr: materializing
# A tibble: 1 × 1
      n
  <int>
1    68
## 68 species

Which species has the most records?

my_species_record_count <- 
  my_catagliphis_data |> 
  count(valid_species_name) |>
  arrange(desc(n)) 

my_species_record_count |>
  DT::datatable()
duckplyr: materializing
# the most recorded species seems to be Cataglyphis.albicans

Let us visualize the sampling effort

my_species_record_count |> pull(n) |> hist()
duckplyr: materializing

The figure shows a highly skewed dataset, with a very dominant species.

How many species per country?

my_catagliphis_data |> 
  count(valid_species_name, country) |>
  arrange(desc(n)) |>
  DT::datatable()
duckplyr: materializing

Seems that there is relatively a lot of data for Cataglyphis.velox in Spain.

What is the elevation range of the genus?

my_catagliphis_data |> 
  count(valid_species_name, elevation) |>
  arrange(desc(n))|>
  DT::datatable()
duckplyr: materializing

It seems there is quite a lot or NA elevation records for this species, lets examine the missing data

my_catagliphis_data |> 
  count(is.na(elevation))|>
  DT::datatable()

There are 1121 records with no elevation data. > 50%. What is the distribution of the available elevational records?

my_catagliphis_data |> 
  filter(!is.na(elevation)) |> 
  count(valid_species_name, elevation) |>
  ggplot() +
  geom_col(aes(elevation, n)) + 
  theme_minimal()

duckplyr: materializing

From the observed records, seems that this genus is primarily concentrated in lowlands.

4 Project occurrence points on a world map

# Retrieve world map data as an sf (simple features) object
world <- ne_countries(scale = "medium", returnclass = "sf")

# Plot the world map and overlay the latitude and longitude points
ggplot(data = world) +
  geom_sf(fill = "antiquewhite") +  # Plot the world map
  geom_point(data = my_catagliphis_data, aes(x = dec_long, y = dec_lat), color = "red", size = 1) +  # Plot points
  coord_sf(expand = FALSE) +
  theme_minimal() +
  labs(title = "Global distribution of Cataglyphis ants",
       x = "Longitude", y = "Latitude")

Looks like its an old world genera! Let us restric the view.

# Plot the world map and overlay the latitude and longitude points
ggplot(data = world) +
  geom_sf(fill = "antiquewhite") +  # Plot the world map
  geom_point(data = my_catagliphis_data, aes(x = dec_long, y = dec_lat), color = "red", size = 1) +  # Plot points
  coord_sf(expand = FALSE) +
  theme_minimal() +
  labs(title = "Old-world distribution of Cataglyphis ants",
       x = "Longitude", y = "Latitude") + 
  coord_sf(xlim = c(-30, 180), ylim = c(-35, 70), expand = FALSE) 

Italy seems underrepresented (maybe the genera do not like mountains / or romans)

5 Project a grid system to quantify species in a grid cell

Define a Lambert Azimuthal Equal Area projection centered over Europe.

locations_sf <- st_as_sf(my_catagliphis_data, coords = c("dec_long", "dec_lat"), crs = 4326)


# (You can adjust +lat_0 and +lon_0 to suit your region of interest.)
equal_area_crs <- st_crs("+proj=laea +lat_0=52 +lon_0=10 +datum=WGS84 +units=m +no_defs")

# Transform the world map and locations to the equal-area projection
world_ea    <- st_transform(world, crs = equal_area_crs)
locations_ea <- st_transform(locations_sf, crs = equal_area_crs)

Generate an Equal-Area Grid, adjusting the cellsize to your needs

Note this is a computationally expensive operation, specially if you choose very high resolution cells.

To keep things simple, we will create a relatively large grid for this tutorial. Investigate which is the adequate cellsize for your projects. Note that cellsize will determine the spatial bias.

# 500000 m each side 
grid_squared <- st_make_grid(world_ea, cellsize = c(500000, 500000), square = TRUE)

# alternatively, you can generate hexagons. 
grid_hex <- st_make_grid(world_ea, cellsize = c(500000, 500000), square = FALSE)

# make the grids a sf objects
grid_squared <- st_sf(geometry = grid_squared)
grid_hex <- st_sf(geometry = grid_hex)

Let us now visualize the grid

ggplot() +
  geom_sf(data = grid_squared, fill = NA, color = "blue", size = 0.5) +  # Plot grid
  geom_sf(data = world_ea, fill = "antiquewhite", color = "gray60") +  # Plot world map
  geom_sf(data = locations_ea$geometry, color = "red", size = 1) +  # Plot ant points
  theme_minimal() +
  labs(
    title = "Old World Map with an Equal-Area Grid (EPSG:3035)",
    x = "Easting (m)",
    y = "Northing (m)"
  )

or the hexagonal grid

ggplot() +
  geom_sf(data = grid_hex, fill = NA, color = "blue", size = 0.5) +  # Plot grid
  geom_sf(data = world_ea, fill = "antiquewhite", color = "gray60") +  # Plot world map
  geom_sf(data = locations_ea$geometry, color = "red", size = 1) +  # Plot ant points
  theme_minimal() +
  labs(
    title = "Old World Map with an Equal-Area Grid (EPSG:3035)",
    x = "Easting (m)",
    y = "Northing (m)"
  )

Tip

If you are working with a highly resoluted grid, generate it one time, and hard save the grid object to be loaded in below the pipeline. You will avoid re-running the computation to generate the grid each time if done so. You can save a grid as an RDS object with saveRDS('object_name', 'path')

6 Calculate species richness per grid cell by overlapping polygons

Let us now extract the subset of the grid that matches the occurence points.

grid_intersections <- st_intersects(grid_hex, locations_ea)

# Create a logical vector: TRUE if the grid cell has one or more occurrence points.
has_points <- sapply(grid_intersections, function(x) length(x) > 0)

# Subset the grid cells that have occurrence points.
grid_with_occurrence <- grid_hex[has_points, ]

# Inspect the result
plot(grid_hex)
plot(grid_with_occurrence, col = 'red', add = TRUE)

Let us now joint the datasets to count the species richness per grid

grid_joined <- st_join(grid_with_occurrence, locations_ea, join = st_intersects)

head(grid_joined)
Simple feature collection with 6 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -2713277 ymin: -4432507 xmax: -1963277 ymax: -824067.8
Projected CRS: +proj=laea +lat_0=52 +lon_0=10 +datum=WGS84 +units=m +no_defs
  gabi_acc_number           valid_species_name        country elevation
1   GABI_00831746   Cataglyphis.bicolor pubens Canary Islands        NA
2   GABI_00942833            Cataglyphis.velox          Spain      1340
3   GABI_00108752 Cataglyphis.bicolor sudanica           Mali        NA
4   GABI_00825429         Cataglyphis.albicans        Morocco        NA
5   GABI_00825420         Cataglyphis.albicans        Morocco        NA
6   GABI_00825405         Cataglyphis.albicans        Morocco        NA
   bentity2_name       Genus          species          species_name_no_dot
1 Canary Islands Cataglyphis   bicolor pubens   Cataglyphis bicolor pubens
2          Spain Cataglyphis            velox            Cataglyphis velox
3           Mali Cataglyphis bicolor sudanica Cataglyphis bicolor sudanica
4        Morocco Cataglyphis         albicans         Cataglyphis albicans
5        Morocco Cataglyphis         albicans         Cataglyphis albicans
6        Morocco Cataglyphis         albicans         Cataglyphis albicans
             unique_id                       geometry
1  -15.88623 28.285033 POLYGON ((-2463277 -2267444...
2       -20.867 37.933 POLYGON ((-2463277 -1401418...
3    -8.41667 11.93333 POLYGON ((-2213277 -4432507...
4 -12.824167 27.709722 POLYGON ((-2213277 -2700456...
5 -12.916389 27.938056 POLYGON ((-2213277 -2700456...
6   -12.292778 28.0325 POLYGON ((-2213277 -2700456...

Since now geometry refers to the grid, we can crosstabulate

grid_joined |>
count(valid_species_name, geometry) |>
arrange(desc(n))
Simple feature collection with 375 features and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -2713277 ymin: -4432507 xmax: 6286723 ymax: 4372085
Projected CRS: +proj=laea +lat_0=52 +lon_0=10 +datum=WGS84 +units=m +no_defs
First 10 features:
      valid_species_name   n                       geometry
1      Cataglyphis.velox 104 POLYGON ((-1213277 -1834431...
2  Cataglyphis.hispanica  55 POLYGON ((-1463277 -1401418...
3       Cataglyphis.noda  50 POLYGON ((1536723 -1401418,...
4   Cataglyphis.albicans  46 POLYGON ((-1213277 -1834431...
5     Cataglyphis.humeya  42 POLYGON ((-1213277 -1834431...
6    Cataglyphis.bicolor  38 POLYGON ((-713277.1 -183443...
7       Cataglyphis.noda  38 POLYGON ((1036723 -1401418,...
8    Cataglyphis.viatica  37 POLYGON ((-1213277 -1834431...
9     Cataglyphis.cursor  36 POLYGON ((-463277.1 -140141...
10   Cataglyphis.iberica  32 POLYGON ((-963277.1 -140141...

Seems the species Cataglyphis.velox has the greatest sampling effort

Sp_rich <- 
grid_joined |>
group_by(geometry) |>
summarise(spRich = n_distinct(valid_species_name, na.rm = TRUE)) |> 
arrange(desc(spRich))

Sp_rich |> head()
Simple feature collection with 6 features and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -1713277 ymin: -3133469 xmax: 3786723 ymax: -1257081
Projected CRS: +proj=laea +lat_0=52 +lon_0=10 +datum=WGS84 +units=m +no_defs
# A tibble: 6 × 2
                                                                 geometry spRich
                                                            <POLYGON [m]>  <int>
1 ((-1213277 -1834431, -1463277 -1690093, -1463277 -1401418, -1213277 -1…     17
2 ((-1463277 -2267444, -1713277 -2123106, -1713277 -1834431, -1463277 -1…     14
3 ((3536723 -2267444, 3286723 -2123106, 3286723 -1834431, 3536723 -16900…     14
4 ((-463277.1 -2267444, -713277.1 -2123106, -713277.1 -1834431, -463277.…     13
5 ((36722.94 -2267444, -213277.1 -2123106, -213277.1 -1834431, 36722.94 …     11
6 ((3536723 -3133469, 3286723 -2989131, 3286723 -2700456, 3536723 -25561…     11

It seems the highest richness of ‘Cataglyphis’ ants for a given grid of this size is 17 species.

Let us know visualize the distribution of richness across all grids

Sp_rich |> 
ggplot() + 
geom_histogram(aes(spRich))

It seems the majority of grids, have only few species.

Let us visualize the trend

# Plot the world map geometry and overlay the species richness layer.
ggplot() +
# Plot the world map geometry
geom_sf(data = world_ea, fill = "antiquewhite", color = "gray50") +
# Overlay the species richness data (assumed to be an sf object)
geom_sf(data = Sp_rich,aes(fill = spRich), size = 1, alpha = 0.5) +
scale_fill_gradient(low = "darkblue", high = "firebrick", name = "Species Richness") +
labs(title = "World Map with Species Richness Overlay") +
theme_minimal()

Lighter colors indicate greater richness. Of course there are way prettier ways to plot this map for publication. Try and create a publication ready plot!

7 Matching gridcells to climatic variables

Now we have the gridcells that are of interest to us, we can intersect them with climatic data (or any other environmental data)

# Download WorldClim data using the geodata package
wc_geodata <- geodata::worldclim_global(var = "bio", res = 10, path = getwd())

## I download the data to my working directory. Make sure you know where you are working or specify a desired path.

Note that we need to to this only once, the result of this will be a new directory climate in your working directory or in the specified path

We can inspect the resulting variables

wc_geodata |> names()
 [1] "wc2.1_10m_bio_1"  "wc2.1_10m_bio_2"  "wc2.1_10m_bio_3"  "wc2.1_10m_bio_4" 
 [5] "wc2.1_10m_bio_5"  "wc2.1_10m_bio_6"  "wc2.1_10m_bio_7"  "wc2.1_10m_bio_8" 
 [9] "wc2.1_10m_bio_9"  "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12"
[13] "wc2.1_10m_bio_13" "wc2.1_10m_bio_14" "wc2.1_10m_bio_15" "wc2.1_10m_bio_16"
[17] "wc2.1_10m_bio_17" "wc2.1_10m_bio_18" "wc2.1_10m_bio_19"
par(mfrow = c(1,2))
plot(wc_geodata[[1]]) # bio01 (mean annual temperature)
plot(wc_geodata[[12]]) # bio12  (total annual precipitation)

Let us now get the gridded climatic variables

# Convert the sf grid to a terra SpatVector for extraction.
grid_vect <- terra::vect(Sp_rich$geometry)

### 4. Extract Climate Averages per Grid Cell

# Use terra::extract() with a function (fun) that computes the mean value in each grid cell.
# This will compute the mean for each bioclimatic layer across the pixels that fall in each grid cell.
# na.rm = TRUE ensures that missing values are ignored in the averaging.
# note that you can summarize with other functions depending on what it is what you are interested (sd, median, etc). 
climate_avgs <- terra::extract(wc_geodata, grid_vect, fun = mean, na.rm = TRUE)

# The result is a data.frame with an ID column linking to the grid cell and one column per climate variable.
head(climate_avgs)
  ID wc2.1_10m_bio_1 wc2.1_10m_bio_2 wc2.1_10m_bio_3 wc2.1_10m_bio_4
1  1        15.56953        12.15766        40.68768        627.4409
2  2        16.34168        12.69942        40.56823        664.7341
3  3        25.38520        14.35940        39.11337        858.8762
4  4        19.18090        12.70084        36.20170        815.5910
5  5        19.89300        12.13941        38.80515        699.1474
6  6        26.27580        13.99721        50.60714        525.5792
  wc2.1_10m_bio_5 wc2.1_10m_bio_6 wc2.1_10m_bio_7 wc2.1_10m_bio_8
1        32.71224        2.822638        29.88960        10.49078
2        33.59604        2.020723        31.57532        11.70426
3        43.66976        6.953035        36.71673        19.96061
4        38.22764        3.133729        35.09391        14.88502
5        36.48500        5.252325        31.23267        14.55210
6        39.43810       11.553940        27.88416        25.95294
  wc2.1_10m_bio_9 wc2.1_10m_bio_10 wc2.1_10m_bio_11 wc2.1_10m_bio_12
1        23.58425         23.77909         8.484152         496.7821
2        24.49578         24.87396         8.547563         347.5828
3        35.18457         35.18457        14.320150         163.2493
4        29.50690         29.56810         9.543561         168.0685
5        28.42264         28.55130        11.439091         198.6770
6        28.27270         32.33851        19.587259         154.4172
  wc2.1_10m_bio_13 wc2.1_10m_bio_14 wc2.1_10m_bio_15 wc2.1_10m_bio_16
1         75.94200         4.325132         56.21411        207.86819
2         53.51599         2.681502         56.36299        143.84284
3         40.77195         0.000000         94.79718         87.79037
4         21.49671         3.728590         39.08538         57.55599
5         29.88686         2.076642         55.96664         77.52007
6         36.36898         3.902108         81.05704         80.73042
  wc2.1_10m_bio_17 wc2.1_10m_bio_18 wc2.1_10m_bio_19
1        28.485062        32.520211        191.60105
2        17.324061        18.139082        133.29068
3         1.035411         1.035411         58.34703
4        19.060606        19.102767         48.52437
5        13.427007        19.439781         68.00547
6        17.920181        24.694277         28.57681

We can bind back the climatic data to keep it all tidy

grid_with_climate <- cbind(Sp_rich, climate_avgs[,-1])

grid_with_climate |> head()
Simple feature collection with 6 features and 20 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -1713277 ymin: -3133469 xmax: 3786723 ymax: -1257081
Projected CRS: +proj=laea +lat_0=52 +lon_0=10 +datum=WGS84 +units=m +no_defs
  spRich wc2.1_10m_bio_1 wc2.1_10m_bio_2 wc2.1_10m_bio_3 wc2.1_10m_bio_4
1     17        15.56953        12.15766        40.68768        627.4409
2     14        16.34168        12.69942        40.56823        664.7341
3     14        25.38520        14.35940        39.11337        858.8762
4     13        19.18090        12.70084        36.20170        815.5910
5     11        19.89300        12.13941        38.80515        699.1474
6     11        26.27580        13.99721        50.60714        525.5792
  wc2.1_10m_bio_5 wc2.1_10m_bio_6 wc2.1_10m_bio_7 wc2.1_10m_bio_8
1        32.71224        2.822638        29.88960        10.49078
2        33.59604        2.020723        31.57532        11.70426
3        43.66976        6.953035        36.71673        19.96061
4        38.22764        3.133729        35.09391        14.88502
5        36.48500        5.252325        31.23267        14.55210
6        39.43810       11.553940        27.88416        25.95294
  wc2.1_10m_bio_9 wc2.1_10m_bio_10 wc2.1_10m_bio_11 wc2.1_10m_bio_12
1        23.58425         23.77909         8.484152         496.7821
2        24.49578         24.87396         8.547563         347.5828
3        35.18457         35.18457        14.320150         163.2493
4        29.50690         29.56810         9.543561         168.0685
5        28.42264         28.55130        11.439091         198.6770
6        28.27270         32.33851        19.587259         154.4172
  wc2.1_10m_bio_13 wc2.1_10m_bio_14 wc2.1_10m_bio_15 wc2.1_10m_bio_16
1         75.94200         4.325132         56.21411        207.86819
2         53.51599         2.681502         56.36299        143.84284
3         40.77195         0.000000         94.79718         87.79037
4         21.49671         3.728590         39.08538         57.55599
5         29.88686         2.076642         55.96664         77.52007
6         36.36898         3.902108         81.05704         80.73042
  wc2.1_10m_bio_17 wc2.1_10m_bio_18 wc2.1_10m_bio_19
1        28.485062        32.520211        191.60105
2        17.324061        18.139082        133.29068
3         1.035411         1.035411         58.34703
4        19.060606        19.102767         48.52437
5        13.427007        19.439781         68.00547
6        17.920181        24.694277         28.57681
                        geometry
1 POLYGON ((-1213277 -1834431...
2 POLYGON ((-1463277 -2267444...
3 POLYGON ((3536723 -2267444,...
4 POLYGON ((-463277.1 -226744...
5 POLYGON ((36722.94 -2267444...
6 POLYGON ((3536723 -3133469,...

8 Run basic GLM and GLMM relating variation in richness to climate

The data is ready now to fit a model, lets fit a simple linear model.

We will evaluate how species richness is explained with the variation in temperature and precipitation

\[ S \sim MAT + TAP + \epsilon \]

my_simple_lm <- lm(spRich ~ wc2.1_10m_bio_1  + wc2.1_10m_bio_12, data = grid_with_climate)

Let’s visualize the results of the fitted model

my_simple_lm |> summary()

Call:
lm(formula = spRich ~ wc2.1_10m_bio_1 + wc2.1_10m_bio_12, data = grid_with_climate)

Residuals:
   Min     1Q Median     3Q    Max 
-3.871 -2.283 -1.060  1.044 13.693 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)       2.491042   0.897204   2.776  0.00653 **
wc2.1_10m_bio_1   0.087882   0.041797   2.103  0.03794 * 
wc2.1_10m_bio_12 -0.001111   0.001002  -1.109  0.26997   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.455 on 103 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.06015,   Adjusted R-squared:  0.0419 
F-statistic: 3.296 on 2 and 103 DF,  p-value: 0.04098

It looks like temperature can explain the patterns in richness, but precipitation not quite. Also the R squared is low, meaning the model has low predictive power. You can investigate a better fit by transforming variables (e.g. square-root transformation).

We can also use the glm() function that fits a generalized linear model. In here, we have more control of the model family and other hyperparameters.

# Fit a generalized linear model with Gaussian family,
# which is analogous to a linear model.
my_simple_glm <- glm(
spRich ~ wc2.1_10m_bio_1 + wc2.1_10m_bio_12,
data = grid_with_climate,
family = gaussian(),
control = glm.control(epsilon = 1e-8, maxit = 50, trace = FALSE)
)

summary(my_simple_glm)

Call:
glm(formula = spRich ~ wc2.1_10m_bio_1 + wc2.1_10m_bio_12, family = gaussian(), 
    data = grid_with_climate, control = glm.control(epsilon = 1e-08, 
        maxit = 50, trace = FALSE))

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)       2.491042   0.897204   2.776  0.00653 **
wc2.1_10m_bio_1   0.087882   0.041797   2.103  0.03794 * 
wc2.1_10m_bio_12 -0.001111   0.001002  -1.109  0.26997   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 11.939)

    Null deviance: 1308.4  on 105  degrees of freedom
Residual deviance: 1229.7  on 103  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 568.63

Number of Fisher Scoring iterations: 2

However, the results are comparable.

The marginaleffects package makes it easy to plot results

library(marginaleffects)


marginaleffects::plot_predictions(my_simple_glm, condition = 'wc2.1_10m_bio_1')

marginaleffects::plot_predictions(my_simple_glm, condition = 'wc2.1_10m_bio_12')

The resulting objects are ggplot objects, so they can be styled

partial_plot_temperature <- marginaleffects::plot_predictions(my_simple_glm, condition = 'wc2.1_10m_bio_1')


partial_plot_temperature +
theme_minimal() + 
xlab('Mean annual temperature') + 
ylab('Species richness of Cataglyphis')

9 Run a phylogenetic regression using occurence data an an estimate of polymorphism

9.1 Denpendencies

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ggtree")
# Load dependencies

library(ape)
library(ggtree)

To run a phylogenetic regression, we need a phylogenetic tree. We will use the tree from:

Nelsen, Matthew P.1; Ree, Richard H.1; Moreau, Corrie S.1

Ant-plant interactions evolved through increasing interdependence

link_to_data

The tree is in the file ML_TREE_treepl_185_outsdroppeds.ladderized.increasing.tre

# Load the phylogenetic tree
ant_tree <- read.tree("ML_TREE_treepl_185_outsdroppeds.ladderized.increasing.tre")

Let us now plot the tree

# Create a circular phylogenetic tree plot
phylo_ant_plot <- ggtree(ant_tree, layout="circular") +
  theme_minimal() +  # Clean theme
  theme(
    text = element_text(size=14, family="Arial"), # Adjust font
    plot.margin = margin(10, 10, 10, 10), # Add margin for clarity
    legend.position = "right"
  )
phylo_ant_plot

Now we need to match the species in the tree to the species in the occurrence data

## get the names of the Catagliphys species in the phylogenetic tree
phylo_ant_plot$data$label %in% c(my_catagliphis_data$species_name_no_dot |> str_replace(" ", "_") |> unique()) |> table()

FALSE  TRUE 
 3458     3 
## how many are catagliphys? 

str_detect(phylo_ant_plot$data$label, 'Cataglyphis') |> table()

FALSE  TRUE 
 1728     3 

Not enough data for phylogenetic analysis.

Reuse

Citation

BibTeX citation:
@online{munoz-acevedo2025,
  author = {Munoz-Acevedo, Gabriel},
  title = {Mapping {Species} {Richness:} {Integrating} {Occurrence}
    {Data,} {Climatic} {Variables,} and {Phylogenetic} {Insights} in a
    {Global} {Grid} {Analysis} - {A} Minimal Example},
  volume = {1},
  number = {1},
  date = {2025-03-09},
  doi = {xxxxx/xxxx},
  langid = {en},
  abstract = {In this notebook, there is a minimal tutorial to a spatial
    analysis pipeline that maps species occurrence points to WorldClim
    variables and overlays them on a global grid to quantify species
    richness per cell. There is also examples about using GLM, GLMM, and
    phylogenetic regression, to examine how climate variation to
    estimate an ordinal measure of ant polymorphism.}
}
For attribution, please cite this work as:
Munoz-Acevedo, Gabriel. 2025. “Mapping Species Richness: Integrating Occurrence Data, Climatic Variables, and Phylogenetic Insights in a Global Grid Analysis - A Minimal Example.” Community Assembly and Biogeography Lab: R Tutorial Series. March 9, 2025. https://doi.org/xxxxx/xxxx.