README
A managed ML workflow platform for Amazon Robotics science teams. Write Python, declare your compute needs, and run—without managing any infrastructure.
The Short Version
Metaflow is an open-source Python framework for building and running ML workflows. Originally built at
Netflix, it lets you define pipelines as Python classes where each step is a method decorated with
@step.
This platform deploys a fully managed Metaflow environment on AWS specifically for Amazon Robotics. You get GPU compute, artifact storage, workflow orchestration, a web UI, and SSO authentication—all pre-configured and ready to use.
Metaflow Building Blocks
Understanding four concepts unlocks the entire platform:
Flow
A Python class that defines a directed acyclic graph (DAG) of steps. Each flow is a complete pipeline.
Step
A method decorated with @step. Each step runs in its own container with the compute resources
you specify.
Run
A single execution of a flow. Every run is tracked, versioned, and browsable in the UI.
Artifact
Data assigned to self.* in a step. Automatically serialized to S3 and available in downstream
steps.
Three GPU Tiers
Request the right compute for your workload. Karpenter auto-provisions nodes and scales to zero when you're done.
| Tier | Instance Families | GPUs | Use Case |
|---|---|---|---|
cpu-only |
c, m, r (gen 5+) | — | Data processing, preprocessing, lightweight computation |
standard-gpu |
g5, g6, g6e | L4, L40S | Most training and simulation workloads |
top-tier-gpu |
p4d/de, p5/e/en, p6-b200 | A100, H100, B200 | Large-scale training, multi-GPU, distributed |
You choose the tier via the @resources decorator. The platform handles provisioning, scheduling,
and cleanup.
Platform Architecture
The platform is deployed as six isolated CloudFormation stacks. The key design principle: data is separated from compute. You can rebuild the entire EKS cluster without losing a single artifact or metadata record.
VPC Stack
Multi-AZ VPC with secondary CIDR blocks for pod IP isolation, NAT gateways, and VPC endpoints.
Data Stack
Aurora PostgreSQL (metadata) + S3 (artifacts). Both have RETAIN policies—they survive
stack deletion.
EKS Stack
EKS Auto Mode cluster with Karpenter, Argo Workflows, Argo Events, Kyverno, and the Metaflow service.
Auth Stack
Cognito User Pool with Amazon Federate as the OIDC identity provider. CLI uses PKCE; browser uses OIDC.
DNS Stack
Stage-specific hosted zone with wildcard ACM certificate and cross-account delegation.
Ingress Stack
NLB + Envoy Gateway + External DNS. Routes CLI, API, and UI traffic with TLS termination.
What Metaflow Gives You
Compared to managing your own EKS cluster, GPU instances, and workflow engine:
- Zero infrastructure management. No Terraform, no Kubernetes YAML, no node group tuning.
- GPU autoscaling. Nodes spin up in seconds and scale to zero when idle.
- Reproducibility. Every run is tracked. Every artifact is versioned. Resume from any point.
-
Live monitoring. Training curves, loss metrics, and progress bars in real-time via
@card. -
Failure recovery.
@retryand@checkpointmean transient failures don't lose hours of training. -
Parallel execution. Fan out across thousands of parallel environments with
foreach. - Production scheduling. Cron, event-driven triggers, and cross-flow dependencies out of the box.
- Enterprise auth. Amazon SSO. No shared credentials. No manual key rotation.
How to Access
The platform is available at:
Web UI
https://{stage}.metaflow.simulation.amazon.dev
CLI API
https://{stage}.metaflow.simulation.amazon.dev/cli
Metadata API
https://{stage}.metaflow.simulation.amazon.dev/api/metadata
Replace {stage} with your environment: alpha, sandbox, or
prod.