RDE Workflow Engine is a managed platform for Amazon Robotics science teams. Write Python, decorate with compute requirements, and run—from RL training to large-scale simulation. Infrastructure is handled for you.
Focus on science. The platform handles compute, storage, orchestration, and observability.
Request GPUs with a simple decorator. From L4s for evaluation to H100s and B200s for large-scale training—Karpenter provisions nodes in seconds.
Every step's output is automatically versioned and stored in S3. Resume, reproduce, and share results across runs. Metadata tracked in Aurora PostgreSQL.
Define DAGs with Python decorators. Argo Workflows handles scheduling, retries, and parallel execution.
Chain steps with foreach for hyperparameter sweeps.
Amazon SSO via Federate. No shared credentials—IRSA for workload identity, Cognito for user auth, OIDC
for browser access. Just midway in.
A built-in web UI to browse runs, inspect artifacts, debug failures, and monitor training curves in real time with live-updating cards.
Save training state periodically with @checkpoint. Resume from the last good checkpoint after
failures. Pair with @retry for automatic recovery.
Six isolated CloudFormation stacks. Data separated from compute for safety. Designed so the EKS cluster can be rebuilt without losing a single artifact.
Everything you need to go from zero to running your first flow on GPU infrastructure.