Copying Large Files to EFS

Prerequisites

Install:

AWS Configuration

Run aws configure. This common will ask you for some parameters:

  • aws account id: 023300502152

  • region: us-west-2

  • user name, access key id, secret key: Coordinate with Yoni to get these if you don’t have them already

Configuring kubctl to reach smartflow

(From David Joy)

Once kubectl and aws are installed you’ll want to configure it to be able to reach the cluster where Smartflow is running, here’s a little bash script that should do that for you:

################################  
ENVIRONMENT_NAME=kitware-prod-v2  
################################  
AWS_ACCOUNT_ID=$(aws sts --profile iarpa get-caller-identity --query "Account" --output text)  
AWS_REGION=us-west-2  
  
aws eks --profile iarpa --region $AWS_REGION update-kubeconfig \  
	--name "smartflow-${ENVIRONMENT_NAME}-eks" \  
	--role-arn "arn:aws:iam::${AWS_ACCOUNT_ID}:role/smartflow-${ENVIRONMENT_NAME}-${AWS_REGION}-eks-admin"  

rsync and kubernetes

Copying files to a kubernetes pod is tricky, below is a script which makes this less painful. Source

krsync.sh

#!/bin/bash

if [ -z "$KRSYNC_STARTED" ]; then
    export KRSYNC_STARTED=true
    exec rsync --blocking-io --rsh "$0" $@
fi

# Running as --rsh
namespace=''
pod=$1
shift

# If use uses pod@namespace rsync passes as: {us} -l pod namespace ...
if [ "X$pod" = "X-l" ]; then
    pod=$1
    shift
    namespace="-n $1"
    shift
fi

exec kubectl $namespace exec -i $pod -- "$@"

Then you can use krsync where you would normally rsync:

krsync -av --progress --stats src-dir/ pod:/dest-dir

Or you can set the namespace:

krsync -av --progress --stats src-dir/ pod@namespace:/dest-dir

Connecting to Smartflow

You can forward the Smartflow GUI port to your local machine with the following command:

>>> kubectl -n airflow port-forward service/airflow-webserver 8080:8080  

And then reach the GUI at: http://localhost:8080

Launch a KIT_DEMO_WAIT job

Find the KIT_DEMO_WAIT job in the main interface. To launch, click the green “Play” button, and choose to run with params. Set the wait time to the amount of time you estimate rsyncing and then unpacking your dataset will take (plus a healthy buffer to account for error and slowdowns).

Finally we can copy!

To find the $POD_ADDR of your waiting pod:

kubectl -n airflow get pods

To log into it (if necessary to install rsync or other packages):

kubectl -n airflow exec -it pods/$POD_ADDR -- bash

To copy files to the EFS share mounted to the pod:

krsync -av --progress --stats $DATA_FPATH $POD_ADDR:/efs/work/$DEST_FPATH

In my experience, copying from KHQ runs at around 8-10 MB/s.

Common pitfalls

  1. When rsyncing a tarball, make sure to follow [[Dealing with tar error, cannot change owner]] when untarring the file.

  2. You must have rsync executable in the pod image for this to work. On a minimal Docker image: apt update; apt install rsync