TempEst2¶
TempEst 2/SCHEMA is “TEMPerature ESTimation, version 2, using Seasonal Conditions Historical Estimation with Modeled daily Anomaly”. It estimates stream water temperature at a point, for streams of any size, using a data-driven model based on satellite remote sensing data.
The GitHub repository
contains the model source code (model.R), Google Earth Engine data
retrieval script (eeretrieval.py) and example data collection points
(datapts.py), and R Notebooks demonstrating model usage
(demo.Rmd) and a large set of model performance tests
(validation.Rmd), used to make the performance claims reproducible.
In addition, the HydroShare data (see below) includes knitted output of
both Notebooks (demo.pdf, validation.pdf) and the full
training/testing dataset (AllData.csv, ExtData.csv including
maximum temperatures).
In addition to the GEE data retrieval script, there is a utility script
in Python, points_above.py, for generating lists of coordinates in
the correct format, evenly-spaced above given target site(s) - this is
useful for quickly generating study points over rivers of interest.
Quick Guide¶
Dependencies in R: tidyverse, fields. The validation notebook
also requires terra, sf,ggspatial, andhydroGOF`.
Additional dependencies for Google Earth Engine data retrieval: Python
with the GEE API configured.
Note that it is advisable to run at least 5 years and preferably 10, even if you are interested in a shorter period, so that TempEst 2 has adequate data to characterize the point of interest (see period-of-record tests in the paper).
Using a Trained Model¶
A pre-trained model is available in the GitHub
Releases as
model.rda. load("model.rda") will load a function called
model. The argument to model is a data frame with columns:
id (site ID; character)
lon (longitude; decimal degrees East)
lat (latitude; decimal degrees North)
day (Julian day; integer)
date (as an R Date)
elevation (mean in a 500-m radius about point of interest; meters)
water, shrubland, grassland, barren (land cover, abundance in a 500-m radius about point of interest; unitless, m2/m2)
lst (land surface temperature, daily daytime mean in a 500-m radius about point of interest; Celsius)
humidity (specific humidity, daily mean in a 500-m radius about point of interest; unitless, kg/kg)
Using this data frame as the argument to model, the function will
return the same data frame with three or five new columns:
temp.mod (modeled daily mean temperature; Celsius)
temp.doy (modeled day-of-year mean temperature; Celsius)
temp.anom (modeled temperature anomaly relative to day-of-year mean; Celsius)
temp.plus (modeled maximum temperature relative to day-of-year mean; Celsius)
temp.max (modeled maximum daily temperature; Celsius)
temp.mod = temp.doy + temp.anom.
These three (five) columns are the final output, and can be used to
assess actual estimated temperature (temp.mod); typical seasonal
conditions (temp.doy), such as to assess general long-term behavior;
and departure from typical seasonal conditions (temp.anom), such as
to assess heat waves or other extremes.
Full Steps (Pre-Trained Model)¶
Install
tidyverseandfields.Import the model code (
source("model.R")).Prepare prediction data (see below for default setup). If you prepare it yourself, the data frame must have the columns
id,lon,lat,day,date,elevation,water,shrubland,grassland,barren,lst,humidity(see above for units and meaning).load()the file containing the pre-trained model, e.g.,load("model.rda"), or otherwise have a model function in memory.Call the model function with the input dataframe as the argument, e.g.
result <- model(data). The output dataframe will add the above three or five columns to the input dataframe. Done.
Complete working example, assuming Earth Engine data outputs in
GEEData/ directory (see demo.Rmd):
source("model.R")
load("model.rda")
data <- map_dfr(
list.files("GEEData/", full.names = T),
\(fn) read_csv(fn, col_types="cDddddddddd") %>%
mutate(day = as.integer(format(date, "%j"))) %>%
filter(lst > -273) # filters errors
)
pred <- model(data)
Retrieving Prediction Data¶
Suitable prediction data can be retrieved in any way, but a script for
retrieving the data from Google Earth Engine is included. In
eeretrieval.py, the function getAllTimeseries will retrieve data
for specified points and dates to Google Drive. Points are specified
according to the example format in datapts.py: a list of
[[[longitude, latitude], "point id"], ...].
To run eeretrieval.py:
Specify your GEE cloud project name in the line
ee.Initialize(project='YOUR_CLOUD_PROJECT_HERE').Provide a list of points in the format above, e.g.
datapts = [[[-87.3989084,34.28537757], '02450250'], [[-87.1702791,33.75760677], '02453500'], [[-86.8911032,33.5239948], '02458502']]for the first few USGS gages used for development. Site IDs are used only to identify locations and have no internal meaning. If you are interested in multiple points along certain rivers, you may find the script below for generating evenly-spaced points to be useful.Specify your dates of interest in the format
[("year", "time", "start", "end"), ...]. The script contains an example for automatically doing so, which you can likely just modify to specify your years of interest.Run
getAllTimeseries(points, times, basename, folder, prt=True, wait=...). Basename is the base file name and folder is the Google Drive output directory.waitspecifies pause in seconds between jobs to avoid exceeding the maximum of 3,000 simultaneous jobs; each daily job usually takes about 1 minute per 1,000 sites all told, so for a 20-year run (~7,000 jobs), a wait time of about 20 seconds per 1,000 sites will work. If you are running fewer than 3,000 jobs (about 8 years), you do not need a wait time.After the jobs run, download the output data and use for prediction as above.
Complete working example (see eeretrieval.py):
import ee
from eeretrieval import getAllTimeseries
ee.Initialize(project='YOUR_CLOUD_PROJECT_HERE')
datapts = [[[-87.3989084,34.28537757], '02450250'], [[-87.1702791,33.75760677], '02453500'], [[-86.8911032,33.5239948], '02458502']]
times = [(str(x), str(y+1),
str(dt.date(x, 1, 1) + dt.timedelta(y)),
str(dt.date(x, 1, 1) + dt.timedelta(y+1)))
for x in range(2001, 2024)
for y in range(365)]
getAllTimeseries(datapts, times, "AllData", "AFolder", prt=True, wait=20)
Generating Evenly-Spaced Points¶
You’ll need a list of points of interest (USGS gage IDs, COMIDs, or coordinates - all in the same format), your desired maximum number of points per river, and a corresponding list of base IDs (and can optionally specify the resolution; the default is 1,000).
ID format:
USGS gage:
"USGS-<ID>"withsite_type="usgs"COMID: just the COMID string with
site_type="comid"Coordinates: (lon, lat) in decimal degrees with
site_type="coordinates"
Then, just run
points_above.points_above_all([<site ids>], <site type>, <max points per river>, <resolution>, [<base ids>])
to get a list in the suitable format.
Example for Clear Creek and Sagehen Creek:
points_above_all(["USGS-10343500", "USGS-06719505"], "usgs", 50, 1000, ["Sagehen", "ClearCreek"])
Training a Model¶
Using all default options, full.schema() returns a function for
training a model. This function’s argument is a data frame similar to
above, but with one additional column: temperature, the observed
temperature (daily mean, Celsius) for training (also temperature.max
if you wish to train a model for daily maximum temperature as well).
full.schema()(training data) then returns a function like model
above. A typical approach to model training is for id to be set to
training ga(u)ge IDs, such as the USGS stream temperature gage network,
so that spatial observations can readily be linked to observed
temperatures. Prediction data and observations are retrieved
independently of the training observations, but an example for USGS
gages is included in demo.Rmd.
full.schema provides some additional options.
First, by default the model is trained with the two kriging models
included, but the full.schema function itself is a generic framework
for linking a seasonality (sche) and anomaly (ma) component, and
these can be specified separately, e.g. to test a different approach
without changing any of the infrastructure.
The third argument to full.schema is rtn.model, which is
FALSE by default. If it is set to TRUE, then, instead of
returning a prediction function,
full.schema(rtn.model=TRUE)(training data) will return a list of
model components. This is useful to examine model behaviors directly,
such as extracting coefficients from the kriging models. This is shown
in demo.Rmd.
If the fourth argument to full.schema, use.max, is TRUE
(default FALSE), then the model will be trained to predict both mean
and maximum temperature, requiring a temperature.max column in the
training data.
Citation and More Information¶
If you use this model in your research, please cite Philippus, Corona, Schneider, Rust, and Hogue, (2025), “Satellite-Based Spatial-Statistical Modeling of Daily Stream Water Temperatures at the CONUS Scale”, Journal of Hydrology, https://doi.org/10.1016/j.jhydrol.2025.133321. This paper also contains a detailed description of model design and performance characteristics.
The seasonality component is based on the “three-sine” stream annual temperature cycle function described in: Philippus, Corona, and Hogue, (2024), “Improved annual temperature cycle function for stream seasonal thermal regimes”, JAWRA, https://doi.org/10.1111/1752-1688.13228.
Full training datasets, pre-trained models, and a knitted model validation notebook PDF are available in CUAHSI HydroShare: https://doi.org/10.4211/hs.a8b243957f7946e388d10ab206990675.
Source code is on GitHub: https://github.com/mines-ciroh/tempest2