Using the MLOps tool to run GeoWATCH¶
NOTE: THE PATH FORMAT IS IN HIGH FLUX. DOCS MAY BE OUTDATED
MLOps is a wrapper around the cmd_queue library that provides a single entrypoint for all steps of the GeoWATCH TA-2 pipeline, starting with a model checkpoint and the program data:
Predict BAS on low-res data (S2/L8)
Pixel evaluation metrics on BAS
Convert to polygons in IARPA site model geojson format
Score on IARPA metrics [using a heuristic for SC phases]
Visualize BAS
Create cropped SC dataset from BAS predictions with high-res data (S2/WV/PD)
Predict SC on high-res data
Pixel evaluation metrics on SC
Convert to polygons in IARPA site model geojson format (again)
Score on IARPA metrics (again)
Visualize SC
You can invoke any one of these things through python -m geowatch...
without MLOps, but it handles all the plumbing of feeding jobs’ input/output to each other and spawning them to use all your available compute resources.
using DVC and geowatch_dvc¶
(Note: geowatch_dvc
will be replaced by sdvc
from the simple_dvc
package.)
geowatch_dvc
is an alias for python -m geowatch.cli.find_dvc
.
MLOps uses it to manage paths to your DVC repos.
To get started:
geowatch_dvc list
Do you see a data and expt repo with exists=True? If not,
a) you have an existing local DVC checkout, but it’s not listed
geowatch_dvc add --name=my_data_repo --path=path/to/smart_data_dvc --hardware=hdd --priority=100 --tags=phase2_data
geowatch_dvc add --name=my_expt_repo --path=path/to/smart_expt_dvc --hardware=hdd --priority=100 --tags=phase2_expt
b) you don’t have an existing local DVC checkout
clone the source repos: https://gitlab.kitware.com/smart/smart_data_dvc https://gitlab.kitware.com/smart/smart_expt_dvc
and put them in the recommended places.
these can also be overridden by environment variables; do so at your own risk
if this worked, you can:
cd $(geowatch_dvc --tags="phase2_expt")
cd doesn’t work when the terminal is too narrow for 1 line, because it’ll prettyprint with linebreaks
using mlops to checkout a model¶
geowatch mlops
is an alias for python -m geowatch.cli.mlops_cli
or python -m geowatch.mlops.expt_manager
.
The entrypoint in code is watch/mlops/expt_manager.py
.
Let’s choose a model and do stuff with it!
geowatch mlops --help
MODEL="Drop4_BAS_Retrain_V002"
geowatch mlops "pull packages" --model_pattern="${MODEL}*"
geowatch mlops "pull evals" --model_pattern="${MODEL}*"
geowatch mlops "status"
python -m geowatch.mlops.expt_manager "list" --model_pattern="${MODEL_OF_INTEREST}*" # for more info
python -m geowatch.mlops.expt_manager "report" --model_pattern="${MODEL_OF_INTEREST}*" # for more info
The “volatile” table shows model predictions, which are not versioned for space reasons, but are an intermediate product of running pixel and polygon evaluations. The “versioned” table shows models and evaluations which should now exist in your local DVC repo.
the “versioned” table in more detail:
type - “pkg_fpath” (model) or “eval_*”
has_raw - the file exists
has_dvc - the
.dvc
file existsneeds_pull - the
.dvc
file is not backed by data on this machineneeds_push - the
.dvc
file is not backed by data on the remote
TODO does “staging” still exist?
models on disk live in $(geowatch_dvc --tags="phase2_expt")/models/fusion/${DATASET}/packages
.
predictions on disk live in $(geowatch_dvc --tags="phase2_expt")/models/fusion/${DATASET}/pred
.
evals on disk live in $(geowatch_dvc --tags="phase2_expt")/models/fusion/${DATASET}/eval
.
using mlops to run an evaluation¶
geowatch mlops “evaluate” should work as an alias but doesn’t right now
if you haven’t used this TEST_DATASET yet, remember to unzip its splits.zip first
DATASET_CODE=Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC
DATA_DVC_DPATH=$(geowatch_dvc --tags="phase2_data")
DVC_EXPT_DPATH=$(geowatch_dvc --tags="phase2_expt")
TEST_DATASET=$DATA_DVC_DPATH/$DATASET_CODE/data.kwcoco.json
python -m geowatch.mlops.schedule_evaluation \
--params="
matrix:
trk.pxl.model:
- $DVC_EXPT_DPATH/models/fusion/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC/packages/Drop4_BAS_Retrain_V002/Drop4_BAS_Retrain_V002_epoch=31-step=16384.pt.pt
trk.pxl.data.test_dataset:
- $TEST_DATASET
trk.pxl.data.window_scale_space: 15GSD
trk.pxl.data.time_sampling:
- "contiguous"
trk.pxl.data.input_scale_space:
- "15GSD"
trk.poly.thresh:
- 0.07
- 0.1
- 0.12
- 0.175
crop.src:
- /home/joncrall/remote/toothbrush/data/dvc-repos/smart_data_dvc/online_v1/kwcoco_for_sc_fielded.json
crop.regions:
- trk.poly.output
act.pxl.data.test_dataset:
- /home/joncrall/remote/toothbrush/data/dvc-repos/smart_expt_dvc/models/fusion/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC/crop/online_v1_kwcoco_for_sc_fielded/trk_poly_id_0408400f/crop_f64d5b9a/crop_id_59ed6e1b/crop.kwcoco.json
act.pxl.data.input_scale_space:
- 3GSD
act.pxl.data.time_steps:
- 3
act.pxl.data.chip_overlap:
- 0.35
act.poly.thresh:
- 0.01
act.poly.use_viterbi:
- 0
act.pxl.model:
- $DVC_EXPT_DPATH/models/fusion/Aligned-Drop4-2022-08-08-TA1-S2-WV-PD-ACC/packages/Drop4_SC_RGB_scratch_V002/Drop4_SC_RGB_scratch_V002_epoch=99-step=50300-v1.pt.pt
include:
- act.pxl.data.chip_dims: 256,256
act.pxl.data.window_scale_space: 3GSD
act.pxl.data.input_scale_space: 3GSD
act.pxl.data.output_scale_space: 3GSD
" \
--enable_pred_trk_pxl=1 \
--enable_pred_trk_poly=1 \
--enable_eval_trk_pxl=1 \
--enable_eval_trk_poly=1 \
--enable_crop=0 \
--enable_pred_act_pxl=0 \
--enable_pred_act_poly=0 \
--enable_eval_act_pxl=0 \
--enable_eval_act_poly=0 \
--enable_viz_pred_trk_poly=1 \
--enable_viz_pred_act_poly=0 \
--enable_links=1 \
--devices="0,1" --queue_size=2 \
--queue_name='secondary-eval' \
--backend=serial --skip_existing=0 \
--run=0
the enable pred, eval, crop, and viz flags are the various stages of the GeoWATCH TA2 system. They are all enabled by default. Any params for a step with --enable_*=0
will be ignored, and duplicate params will be grid-searched over.
The params are namespaced for convenience
trk.pxl - BAS predict and pixel evaluate
trk.poly - BAS tracking and IARPA poly evaluate
crop - create SC dataset
act.pxl - SC predict and pixel evaluate
act.poly - SC tracking and IARPA poly evaluate
First, dry-run with run=0, and if it looks ok, set run=1. –backend=tmux is also highly recommended instead of serial, it makes the running queue much easier to debug (backend=”slurm” is also available if your machine supports it).
--virtualenv_cmd="conda activate watch"
or something to that effect is also needed if your bashrc does not automatically drop you into the watch virtualenv.
debugging/examining a running mlops invocation¶
This assumes you’re using –backend=tmux.
check on a running queue with
tmux a
<C-b> s # switch between sessions
<C-b> d # quit back to main window
You’ll see each session start with the command source /long/path/_cmd_queue_schedule/${QUEUE_NAME}_etc/etc/etc.sh
. The paths will also be printed to the main window for convenience, along with the commands to kill any leftover tmux sessions or drop into them.
These sourced scripts define what is being run in each tmux session (“job”). In serial mode, they will all run in the main window.
When each job finishes, the result will be cached and its dependencies will start running.
What does an mlops invocation run?¶
You can dig through the sourced scripts to piece together what a full GeoWATCH TA2 run looks like, minus the boilerplate and error handling. (There are plenty of examples in the mlops source code, but here’s how to tell exactly what you were running.) For example, the pipeline above turns into:
# prediction (omitted)
# python -m geowatch.tasks.fusion.predict ...
# tracking
ROOT="/home/local/KHQ/matthew.bernstein/data/dvc-repos/smart_expt_dvc-ssd/models/fusion/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC/pred/trk/Drop4_BAS_Retrain_V002_epoch=31-step=16384.pt.pt/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC_data.kwcoco"
PXL_EXPT="${ROOT}/trk_pxl_b788335d"
TRK_EXPT="trk_poly_ca4372e1"
python -m geowatch.cli.run_tracker \
"${TRK_ROOT}/pred.kwcoco.json" \
--default_track_fn saliency_heatmaps \
--track_kwargs '{"thresh": 0.12, "moving_window_size": null, "polygon_fn": "heatmaps_to_polys"}' \
--clear_annots \
--out_sites_dir "${TRK_ROOT}/${TRK_EXPT}/sites" \
--out_site_summaries_dir "${TRK_ROOT}/${TRK_EXPT}/site-summaries" \
--out_sites_fpath "${TRK_ROOT}/${TRK_EXPT}/site_tracks_manifest.json" \
--out_site_summaries_fpath "${TRK_ROOT}/${TRK_EXPT}/site_summary_tracks_manifest.json" \
--out_kwcoco "${TRK_ROOT}/${TRK_EXPT}/tracks.kwcoco.json"
# pxl eval
python -m geowatch.tasks.fusion.evaluate \
--true_dataset=/home/local/KHQ/matthew.bernstein/data/dvc-repos/smart_data_dvc-ssd/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC/data.kwcoco.json \
--pred_dataset="${TRK_ROOT}/pred.kwcoco.json" \
--eval_dpath="${TRK_ROOT}"
--score_space=video \
--draw_curves=True \
--draw_heatmaps=True \
--viz_thresh=0.2 \
--workers=2
# iarpa poly eval
python -m geowatch.cli.run_metrics_framework \
--merge=True \
--name "Drop4_BAS_Retrain_V002_epoch=31-step=16384.pt.pt-trk_pxl_b788335d-${TRK_EXPT}" \
--true_site_dpath "/home/local/KHQ/matthew.bernstein/data/dvc-repos/smart_data_dvc-ssd/annotations/site_models" \
--true_region_dpath "/home/local/KHQ/matthew.bernstein/data/dvc-repos/smart_data_dvc-ssd/annotations/region_models" \
--pred_sites "${TRK_ROOT}/${TRK_EXPT}/site_tracks_manifest.json" \
--tmp_dir "${TRK_ROOT}/${TRK_EXPT}/_tmp" \
--out_dir "${TRK_ROOT}/${TRK_EXPT}" \
--merge_fpath "${TRK_ROOT}/${TRK_EXPT}/merged/summary2.json" \
# viz
geowatch visualize \
"${TRK_ROOT}/${TRK_EXPT}/tracks.kwcoco.json" \
--channels="red|green|blue,salient" \
--stack=only \
--workers=avail/2 \
--workers=avail/2 \
--extra_header="\ntrk_pxl_b788335d-${TRK_EXPT}" \
--animate=True