Accessing GEOWATCH DVC Repos

This document outlines how to access the GEOWATCH DVC repos for internal collaberators.

DVC stands for Data Version Control, and is a layer on top of git that helps manage larger data files. For more information on DVC see getting started with dvc <getting_started_dvc.rst>_.

As of 2022-09-29 there are two primary internal DVC repos:

Note: There is an additional repo for Drop7 Cropped AC/SC data:

This document will outline how to clone the DVC repos, and then how to pull relevant data from them.

Prerequistes

To clone the DVC repos you must have access to gitlab.kitware.com/smart <https://gitlab.kitware.com/smart>_. If you do not have permission please contact someone at Kitware to gain access and have them add you to the smart group.

Once you have access to gitlab.kitware.com/smart <https://gitlab.kitware.com/smart>_, ensure that you have ssh keys setup and registered with gitlab. More details on generating ssh keys and registering them with gitlab can be found in the ssh setup instructions.

To access the internal DVC remotes you must have AWS credentials. For details see the aws getting started docs.

You should also have DVC installed. See getting started with dvc if you are unfamiliar with the concepts of DVC.

Clone the Repos

Assuming you have your ssh keys registered with gitlab.kitware.com, and you are a member of the smart group, you should be able to clone the repos with ssh credentials.

We recommend using $HOME/data/dvc-repos as the location for storing the DVC repos, but we will abstract this with an environment variable DVC_REPOS_DIR, that you can change to the location you want to store the data. (Note: that some geowatch tools can auto-detect DVC repos if they are in the recommended locations).

# Ensure you have git on your system
dpkg -l git > /dev/null || sudo apt install git -y

# This is the recommended location to checkout DVC repos. Change as needed
DVC_REPOS_DIR=$HOME/data/dvc-repos
mkdir -p "$DVC_REPOS_DIR"

# Clone the Data DVC Repo
git clone git@gitlab.kitware.com:smart/smart_data_dvc.git $DVC_REPOS_DIR/smart_data_dvc

# Clone the Experiment DVC Repo
git clone git@gitlab.kitware.com:smart/smart_expt_dvc.git $DVC_REPOS_DIR/smart_expt_dvc

The clone should be very fast. A DVC repo is just a git repo that contains pointers to data that lives elsewhere. The next section provides instructions on how to access that data.

Access data in the Data DVC repo

Assuming you have cloned the data DVC repo the next step is to access data in it.

This will require that you have your AWS credentials setup. By default the DVC repos are configured to access a remote called “aws” via the iarpa aws profile.

First ensure DVC is installed with the S3 backend:

# Ensure dvc is installed
pip install dvc[s3]

To start lets pull the data associated with one the BAS “Drop4” datasets. This is part the “Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC” kwcoco bundle. We will refer to this with an environment variable DATASET_CODE.

# Navigate to the kwcoco bundle
DVC_REPOS_DIR=$HOME/data/dvc-repos
DATASET_CODE=Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC

cd $DVC_REPOS_DIR/smart_data_dvc/$DATASET_CODE

# List the files that exist
ls

You will notice that there are several folders and some “.dvc” files. We need to use these to access the data they are pointing to.

Currently (as of 2022-09-29) the annotations are pointed to by the “splits.zip.dvc” file and the images for each region are pointed to by their own DVC file.

Lets start by grabbing the kwcoco annotation files. The following command will pull the data pointed to by the splits.zip.dvc file from the aws DVC remote.

dvc pull -r aws splits.zip.dvc

This should download in a few seconds. Now if you ls you should see splits.zip. Unzip the kwcoco files from this archive.

unzip splits.zip

Now if you ls you should see data_train.kwcoco.json data.kwcoco.json and data_vali.kwcoco.json.

Note that we only have the kwcoco files, we still have not pulled any of the images that they point to.

To inspect these files we need to ensure we have kwcoco installed. So pip install kwcoco if needed.

Now, if you were to run:

kwcoco validate data_vali.kwcoco.json

You will see that there are 17714 missing images.

To get started more quickly, lets only work with a subset of the data. We can make a new kwcoco file that only points to landsat8 data in “KR_R001” via the kwcoco subset command:

kwcoco subset \
    --src data_vali.kwcoco.json \
    --dst data_KR_R001.kwcoco.json \
    --select_videos '.name == "KR_R001"' \
    --select_images '.sensor_coarse == "L8"'

Running kwcoco validate data_KR_R001.kwcoco.json on this file will now report only 1705 missing images, which will correspond to the data pointed to by the KR_R001/L8.dvc file. To obtain this data we can run:

dvc pull -r aws KR_R001/L8.dvc

This will take a bit longer, but likely no more than a minute or two. Now running:

kwcoco validate data_KR_R001.kwcoco.json

will report no issues.

Using kwcoco stats data_KR_R001.kwcoco.json will provide some information about the dataset.

We could use kwcoco show data_KR_R001.kwcoco.json to inspect the data, but because this is MSI imagery it would be more appropriate to use geowatch visualize data_KR_R001.kwcoco.json (assuming the geowatch system has been installed). Likewise, geowatch stats data_KR_R001.kwcoco.json can provide more geowatch-relevant information.

It is now possible to use this kwcoco file for testing purposes.

Obtaining the rest of the data is similar: simply use dvc pull, and keep in mind kwcoco subset is a useful tool for taking out only a smaller part of the data.

To download all of the data in a directory run with the -R flag for recursive.

dvc pull -r aws -R .

After this downloads, any of the kwcoco files in the directory can be used.

We recommend using geowatch_dvc tool to register the path you cloned these repos to as illustrated in ../environment/getting_started_dvc.rst