Data synthesis
Combination of heterogeneous data in baseline analyses
Statistical approaches to modeling biodiversity change are often paralyzed by limitations inherent in biodiversity data—especially when historical and contemporary data are combined from multiple sources.
- Data are commonly unstructured, or heterogeneous, lacking consistent attributes or metadata
- Data are often biased toward positive detections (lacking absence information)
- Data may be collected in ways that violate the assumptions of statistical models
- Georeferencing is often generalized based on vague locality information
- Data may be sparse, with few or perhaps only singleton records to vouch for historical populations, especially at local scales
Baseline analyses offer a practical approach
A practical approach to this difficult situation is to forego any advanced modeling practices ab initio and begin instead with a simple baseline analysis of historical vs contemporary sources of biodiversity data. This approach will generally depend on normalization of species occurrence data across multiple datasets, to support the comparison of historical and contemporary species records.
Normalization of biodiversity data is a thorny enough problem in its own right and not treated here. For this tutorial we assume that names are aligned and any ambiguity (synonymy, orthographic variants, etc.) is resolved between the datasets being compared.
- Best practice is to adhere to Darwin Core standards for species occurrence data.
- Different communities may prefer different taxonomic databases, but any standard is acceptable as long as it enables consistent mapping of taxa across datasets.