README

A managed ML workflow platform for Amazon Robotics science teams. Write Python, declare your compute needs, and run—without managing any infrastructure.

What is Metaflow

The Short Version

Metaflow is an open-source Python framework for building and running ML workflows. Originally built at Netflix, it lets you define pipelines as Python classes where each step is a method decorated with @step.

This platform deploys a fully managed Metaflow environment on AWS specifically for Amazon Robotics. You get GPU compute, artifact storage, workflow orchestration, a web UI, and SSO authentication—all pre-configured and ready to use.

Who is this for? Any Amazon Robotics science team running compute-intensive workloads: RL policy training, physics simulation (Isaac Lab, MuJoCo, Drake), hyperparameter sweeps, data pipelines, or any workflow that benefits from GPU acceleration and reproducibility.
Key Concepts

Metaflow Building Blocks

Understanding four concepts unlocks the entire platform:

Flow

A Python class that defines a directed acyclic graph (DAG) of steps. Each flow is a complete pipeline.

Step

A method decorated with @step. Each step runs in its own container with the compute resources you specify.

Run

A single execution of a flow. Every run is tracked, versioned, and browsable in the UI.

Artifact

Data assigned to self.* in a step. Automatically serialized to S3 and available in downstream steps.

Compute Tiers

Three GPU Tiers

Request the right compute for your workload. Karpenter auto-provisions nodes and scales to zero when you're done.

Tier Instance Families GPUs Use Case
cpu-only c, m, r (gen 5+) Data processing, preprocessing, lightweight computation
standard-gpu g5, g6, g6e L4, L40S Most training and simulation workloads
top-tier-gpu p4d/de, p5/e/en, p6-b200 A100, H100, B200 Large-scale training, multi-GPU, distributed

You choose the tier via the @resources decorator. The platform handles provisioning, scheduling, and cleanup.

Architecture

Platform Architecture

The platform is deployed as six isolated CloudFormation stacks. The key design principle: data is separated from compute. You can rebuild the entire EKS cluster without losing a single artifact or metadata record.

VPC Stack

Multi-AZ VPC with secondary CIDR blocks for pod IP isolation, NAT gateways, and VPC endpoints.

Data Stack

Aurora PostgreSQL (metadata) + S3 (artifacts). Both have RETAIN policies—they survive stack deletion.

EKS Stack

EKS Auto Mode cluster with Karpenter, Argo Workflows, Argo Events, Kyverno, and the Metaflow service.

Auth Stack

Cognito User Pool with Amazon Federate as the OIDC identity provider. CLI uses PKCE; browser uses OIDC.

DNS Stack

Stage-specific hosted zone with wildcard ACM certificate and cross-account delegation.

Ingress Stack

NLB + Envoy Gateway + External DNS. Routes CLI, API, and UI traffic with TLS termination.

What You Get

What Metaflow Gives You

Compared to managing your own EKS cluster, GPU instances, and workflow engine:

Domain & Access

How to Access

The platform is available at:

Web UI

https://{stage}.metaflow.simulation.amazon.dev

CLI API

https://{stage}.metaflow.simulation.amazon.dev/cli

Metadata API

https://{stage}.metaflow.simulation.amazon.dev/api/metadata

Replace {stage} with your environment: alpha, sandbox, or prod.

Authentication required. All endpoints require Amazon SSO authentication via Midway. See the Onboarding guide for setup instructions.