BIRDIE DST: Data preparation
v08-birdie-dst-data-prep.Rmd
Introduction
The species distributions module (DST) of the BIRDIE pipeline has four main steps: data preparation, model fitting, model diagnostics and model summary. See the BIRDIE: basics and BIRDIE: species distributions vignettes for general details about BIRDIE and about the DST module, respectively. In this vignette, we will go through the different tasks that are performed during the first step of the DST module: data preparation.
The main function used for data preparation is
ppl_create_site_visit()
. This is a ppl_
function, and therefore it doesn’t do much processing itself (see
BIRDIE: basics if this is confusing), but it does call the
right functions to do the work.
Its name makes reference to site and visit because it will create two separate datasets from ABAP data: one with one entry for each site (pentad) visited, and one with one entry for each visit conducted (i.e., pentads can be visited more than once). In fact, this function will create a third dataset. For any particular year, sites and visits will be common for all species, so we need a third dataset with the detection information for each species.
Data preparation has three main tasks:
- Download ABAP data
- Annotate site and visit data with environmental covariates
- Format site, visit and detection data
By just running ppl_create_site_visit()
all of the tasks
will be performed and our data will be prepared. However, to understand
what goes on “under the hood”, we will explain each of these tasks, and
how they are conducted, below.
Note that data preparation can be time-consuming and it is more
efficient to prepare data for multiple years at once. Therefore, while
occupancy models are run for one year at a time, data preparation is
done for several years at once. The number of years prepared is given by
the dur
argument of the configPipeline()
function. This means that we don’t need to prepare data every time the
pipeline runs. For example, the piece of code below would configure the
pipeline to prepare data for the years 2008, 2009, 2010, because we set
year = 2010
and dur = 3
.
config <- configPipeline(year = 2010,
dur = 3,
occ_mod = c("log_dist_coast", "elev"),
det_mod = c("log_hours"),
fixed_vars = c("dist_coast", "elev"),
package = "spOccupancy",
data_dir = "analysis/data",
out_dir = "analysis/output",
server = FALSE)
Then, the main pipeline function ppl_run_pipe_dst1()
gives us the option to select which year we want to run models for with
the argument year
and we may skip data preparation by
setting force_gee_dwld = FALSE
,
force_site_visit = FALSE
, like so
ppl_run_pipe_dst1(sp_code = sp_code,
year = 2008, # this is the year models will run for
config = config,
steps = c("data"),
force_gee_dwld = FALSE,
monitor_gee = TRUE,
force_site_visit = TRUE,
force_abap_dwld = FALSE,
spatial = FALSE,
print_fitting = TRUE)
See more explanations about what gee and site_visit mean below.
Download ABAP data
This process is facilitated by the use of the ABAP
R
package.
There isn’t much to say that is not explained on the GitHub repository of the ABAP package (check it out).
We will need to call the functions ABAP::getAbapData()
and ABAP::getRegionPentads()
several times during the data
preparation process.
Annotate with environmental covariates
We use environmental covariates to model occupancy and detection probabilities, which requires detection/non-detection data to be annotated with this information. To facilitate the automation of this process and periodic updates when new data becomes available, we use the datasets and functionality offered by Google Earth Engine (GEE).
The functionality to connect and transfer data to/from GEE is
provided by the ABDtools
R package. This package basically wraps functions from rgee
; another
package on which it depends heavily. Therefore, it is a requirement to
have rgee
properly installed and configured to be able to
perform data-annotation tasks. Check the GitHub repos for rgee
and ABDtools
.
Once these two packages are installed and configured, we can use
their functionality in the pipeline. There are two functions in BIRDIE
that are used to annotate ABAP data: prepGEESiteData()
and
prepGEEVisitData()
, which are used to annotate site and
visit data, respectively. See ?prepGEESiteData()
and
?prepGEEVisitData()
for details. All GEE related functions
have been packaged in the file R/utils-gee.R
Annotating with different variables using GEE require different
procedures. Therefore, there is no way to flexibly communicate to these
functions which variables we want to annotate our data with. Instead, we
have hard-coded the variables we are using for the
BIRDIE pipeline. If we wanted to change the variables we use, then we
would have to modify the prepGEESiteData()
and
prepGEEVisitData()
functions. This is not ideal, but it is
how it is set up currently. We also have to keep this in mind when we
pass the covariates we want to use in the models to the
configPipeline()
function in the control script (see
BIRDIE-spp-distributions). These covariates must be among those
provided by prepGEESiteData()
and
prepGEEVisitData()
and have the same names.
Another important thing to keep in mind is that some environmental layers used don’t have information past a certain date. We have set up the functions in such a way that data past the last date of the layer get annotated with the latest available information (last date of the layer). Whenever the pipeline is run it is advised to review the environmental layer used and the last date information is available for, and update the functions if necessary.
This now brings us to the next part of the data preparation, which is the formatting of site, visit and detection data.
Format site, visit and detection data
In the last step of data preparation, we need to take the data coming
out of GEE and reformat it for occupancy modelling. Note that at this
stage the data will not be formatted for any particular package, they
will just take a good starting point for being used for occupancy
modelling. The function we use for this is
createOccuData()
.
Here, we create site and visit data frames that have the covariates
specified in configPipeline()
. Data coming from GEE will be
in a wide format, meaning that each variable and year will be in a
separate column. In general, we would like variables to be in one column
and years in another column. This is one of the important tasks
createOccuData()
will do for us. We also use this function
to create transformations of those variables coming from GEE (e.g. we
sometimes use log transformations) and interactions.
createOccuData()
is also hard-coded, so if
we decide to use some new variable transformation, we need to modify
this function to create it explicitly. Interactions should be handled
correctly as long as the variables involved in the interaction are
present in the data.