Clean and Flag Fossil Occurrence Records Based on Multiple Criteria
clean_occ_fossil.RdThis function processes and standardizes fossil occurrence records by calculating age estimates (midpoint, upper, or lower bounds), computing age uncertainties, flagging records with large age ranges, optionally computing species temporal ranges (origination and extinction times), and identifying subspecies-level identifications. It provides a comprehensive data cleaning pipeline for paleobiological analyses.
Usage
clean_occ_fossil(
df.occ.fossil,
method.ages = c("midpoint", "upper", "lower"),
thresh.age.range = 10,
species = "species",
Max.age = "Maximum_Age",
Min.age = "Minimum_Age",
remove.sub.species = TRUE,
comp.TS.TE = TRUE,
lat = NULL,
lng = NULL,
site = NULL,
group = NULL,
trait = NULL
)Arguments
- df.occ.fossil
A data frame containing fossil occurrence records with at least three columns: species names, maximum (oldest) age estimates, and minimum (youngest) age estimates. Additional columns may include spatial coordinates, site information, group assignments, and trait values.
- method.ages
Character. The method used to estimate representative ages from occurrence records. Options include:
"midpoint"(default): Calculate the midpoint between max and min ages"upper": Use the maximum (oldest) age"lower": Use the minimum (youngest) age
Note: Currently, the function always calculates midpoint regardless of this parameter.
- thresh.age.range
Numeric. The threshold for flagging occurrence records with large age uncertainties (in millions of years or appropriate time units). Records with age ranges (Max.age - Min.age) greater than or equal to this value are flagged as
flag.age.range = "TRUE". Default is 10.- species
Character. The name of the column in
df.occ.fossilcontaining species identifiers. Default is "species".- Max.age
Character. The name of the column in
df.occ.fossilcontaining the maximum (oldest) age estimate for each occurrence record. Default is "Maximum_Age".- Min.age
Character. The name of the column in
df.occ.fossilcontaining the minimum (youngest) age estimate for each occurrence record. Default is "Minimum_Age".- remove.sub.species
Logical. Should subspecies-level identifications be flagged? Default is TRUE. When TRUE, adds a
subspeciescolumn identifying records with three or more words in the species name (e.g., "Genus species subspecies").- comp.TS.TE
Logical. Should species-level temporal ranges (origination and extinction times) be computed? Default is TRUE. When TRUE, calculates TS (maximum of Max.age) and TE (minimum of Min.age) for each species.
- lat
Character. The name of the column containing latitude coordinates. Default is NULL. If provided, latitude information is retained in output.
- lng
Character. The name of the column containing longitude coordinates. Default is NULL. If provided, longitude information is retained in output.
- site
Character. The name of the column containing site location identifiers. Default is NULL. If provided, site information is retained in output.
- group
Character. The name of the column containing group assignments for species (e.g., clade, family). Default is NULL. If provided, group information is retained in output.
- trait
Character. The name of the column containing trait values for species. Default is NULL. If provided, trait information is retained in output.
Value
A data frame containing all original columns plus the following added columns:
- midpoint
Numeric. The midpoint age calculated as (Max.age + Min.age) / 2.
- age.range
Numeric. The age uncertainty calculated as Max.age - Min.age.
- flag.age.range
Character. "TRUE" if age.range >= thresh.age.range, "FALSE" otherwise. Flags occurrences with potentially problematic age uncertainties.
- TS
Numeric (optional, if comp.TS.TE = TRUE). Species origination time, calculated as the maximum Max.age across all occurrences of that species.
- TE
Numeric (optional, if comp.TS.TE = TRUE). Species extinction time, calculated as the minimum Min.age across all occurrences of that species.
- subspecies
Character (optional, if remove.sub.species = TRUE). "subspecies" if the species name has three or more words, "species" otherwise.
Details
The function performs the following data cleaning and flagging steps:
Subsets and standardizes column names
Converts Max.age and Min.age to numeric (removing non-numeric values)
Calculates midpoint ages: (Max.age + Min.age) / 2
Calculates age uncertainties: Max.age - Min.age
Flags records with age.range >= thresh.age.range
Optionally computes species temporal ranges (TS and TE)
Optionally identifies subspecies-level records
Removes records with NA values in TS or TE (with warning)
Age uncertainty considerations:
Purpose: Large age ranges indicate temporal imprecision
Threshold: User-defined via
thresh.age.rangeAction: Records are flagged but not removed, allowing users to decide how to handle them in downstream analyses
Typical thresholds: 5-10 Ma for Cenozoic studies, 10-20 Ma for Mesozoic studies
Subspecies identification:
Uses word count in species names (binomial = 2 words, trinomial >= 3 words)
May require manual verification as some valid species names have multiple words (e.g., "Genus species complex")
Useful for standardizing taxonomic resolution in analyses
Missing value handling:
Non-numeric age values are converted to NA
Records with NA in TS or TE are removed (when comp.TS.TE = TRUE)
A warning is issued when NAs are found and removed
Examples
if (FALSE) { # \dontrun{
# Create example fossil occurrence data
df_fossils <- data.frame(
species = c("Genus species", "Genus species subspecies", "Genus other"),
Maximum_Age = c(100, 95, 90),
Minimum_Age = c(85, 80, 88),
lat = c(10, 15, 20),
lng = c(-50, -55, -60),
site = c("A", "B", "C"),
group = c("G1", "G1", "G2"),
trait = c(1.2, 2.5, 3.1)
)
# Basic cleaning with default settings
cleaned_data <- clean_occ_fossil(df.occ.fossil = df_fossils)
head(cleaned_data)
# View flagged records with large age uncertainties
flagged <- subset(cleaned_data, flag.age.range == "TRUE")
flagged
# Identify subspecies-level records
subspecies_records <- subset(cleaned_data, subspecies == "subspecies")
subspecies_records
# Use a more stringent age uncertainty threshold
cleaned_strict <- clean_occ_fossil(
df.occ.fossil = df_fossils,
thresh.age.range = 5
)
# Clean without computing TS and TE
cleaned_no_ranges <- clean_occ_fossil(
df.occ.fossil = df_fossils,
comp.TS.TE = FALSE
)
# Clean without flagging subspecies
cleaned_all_taxa <- clean_occ_fossil(
df.occ.fossil = df_fossils,
remove.sub.species = FALSE
)
# Use custom column names
df_custom <- data.frame(
taxon = c("sp1", "sp2", "sp3"),
oldest = c(100, 95, 90),
youngest = c(85, 80, 75)
)
cleaned_custom <- clean_occ_fossil(
df.occ.fossil = df_custom,
species = "taxon",
Max.age = "oldest",
Min.age = "youngest"
)
# Filter out flagged records for downstream analysis
library(dplyr)
reliable_records <- cleaned_data %>%
filter(flag.age.range == "FALSE")
# Summary of data quality
quality_summary <- cleaned_data %>%
group_by(flag.age.range) %>%
summarise(
n_records = n(),
mean_age_range = mean(age.range),
n_species = n_distinct(species)
)
quality_summary
} # }