Deploy Your Own

Full operator guide for deploying an isolated RDE Workflow Engine environment in your AWS account using CDK. Covers infrastructure, authentication, and day-two operations.

~45 minutes
Architecture

What Gets Deployed

The @amzn/ar-metaflow-cdk-constructs package deploys a complete Metaflow platform as a stack-per-concern architecture. Each stack is independently deployable and updateable.

Stack CDK Class Purpose Time
VPC VpcStack VPC with public/private subnets, secondary CIDR for pods, VPC endpoints 3–5 min
DNS DnsStack Stage subdomain, cross-account NS delegation, wildcard ACM certificate 2–3 min
Data DataStack Aurora PostgreSQL (metadata store) + S3 bucket (artifact store) 5–8 min
Auth AuthenticationStack Cognito User Pool with Amazon Federate OIDC integration 2–3 min
EKS EksClusterStack EKS 1.34 Auto Mode, Argo Workflows/Events, External Secrets, Kyverno, Metaflow Helm, GPU/CPU node pools 15–20 min
Ingress IngressStack Gateway API, Envoy Gateway, External DNS, NLB with TLS, OIDC auth 5–8 min

Deploy order: VPC → DNS → Data → Auth → EKS → Ingress

Separation of data and compute. The Data stack (RDS + S3) uses RETAIN deletion policy — your metadata and artifacts survive stack updates and accidental deletions. The compute layer (EKS, Ingress) can be destroyed and recreated without data loss.
Prerequisites

Before You Start

AWS Account Setup

CLI Tools

Tool Version Purpose
aws v2+ AWS API calls, CDK deploy
ada any Amazon credential helper (Conduit)
kubectl 1.28+ Kubernetes cluster verification
brazil-build any Amazon build system
node 18+ CDK synthesis
python 3.10+ Login script execution
mwinit any Midway SSO refresh

CDK Configuration

Ensure your cdk.json includes:

{
  "requireApproval": "never"
}
EKS AZ restrictions. us-east-1, us-west-1, and ca-central-1 have AZs where EKS cannot place control plane components. Run bb app run dev-utils:context to populate AZ context before first synthesis.
Stage Config

Define Your Stage

Add an entry to the STAGES map in lib/constants.ts:

export const STAGES: Record<string, StageConfig> = {
  "my-stage": {
    accountId: "123456789012",
    region: "us-west-2",
    isProd: false,
    oidcConfig: {
      clientId: "iss-metaflow-my-stage",
      idpProviderName: "iss-metaflow-my-stage",
      issuerUrl: "https://idp.federate.amazon.com",
    },
  },
};
Field Required Description
accountId Yes 12-digit AWS account ID
region Yes AWS region (e.g., us-west-2)
isProd Yes Enables termination protection and deletion protection
oidcConfig.clientId Yes Cognito app client name — must match Federate registration
oidcConfig.idpProviderName Yes Cognito identity provider name (≤32 characters)
oidcConfig.issuerUrl Yes Always https://idp.federate.amazon.com (production)
Naming convention. Use iss-metaflow-<stage> for both clientId and idpProviderName. The issuerUrl is always the production Federate endpoint — even for non-prod stages.
Deployment

First-Time Deployment

Authenticate

ada credentials update \
  --account <ACCOUNT_ID> \
  --provider conduit \
  --role IibsAdminAccess-DO-NOT-DELETE \
  --once \
  --profile iss-metaflow-<stage>

Build

brazil-build release

Deploy Stacks

Deploy in order. Each command is idempotent — safe to re-run on failure.

bb cdk deploy <stage>-vpc       # 3-5 min
bb cdk deploy <stage>-dns       # 2-3 min
bb cdk deploy <stage>-data      # 5-8 min
bb cdk deploy <stage>-auth      # 2-3 min  ← record CfnOutputs after this
bb cdk deploy <stage>-eks       # 15-20 min
bb cdk deploy <stage>-ingress   # 5-8 min

# Total: ~30-45 min
Record the auth stack outputs. After the auth stack deploys, note the CloudFormation outputs — you need them for Federate registration: FederateClientId, FederateRedirectUri, FederateSecretStoreCommand, FederateCognitoUpdateCommand, FederateKillSwitchRoleArn.
EKS deploy duration. The EKS stack takes 15–20 minutes. Run ada credentials update --once immediately before this stack to avoid credential expiry mid-deploy.
Federate Setup

Register with Amazon Federate

One-time manual step after the auth stack deploys. Connects Cognito to Amazon's corporate identity provider.

1

Create OIDC profile on Federate

Go to prod.ep.federate.a2z.com and create a new OIDC profile:

  • Type: Corporate
  • Template: Pre-Approved (Federate-Cognito)
  • Interface: Web interface
  • Device: Company laptops
  • Client ID: paste FederateClientId from CfnOutputs
  • Redirect URI: paste FederateRedirectUri from CfnOutputs
  • PKCE: Off · Client Secrets: On
2

Configure Logout & Access

  • Add Logout Configuration → set IAM Role ARN from FederateKillSwitchRoleArn
  • Add your team's POSIX/LDAP groups for access control
3

Save and copy the client secret

The secret is displayed exactly once. Copy it immediately.

Copy the secret immediately. The Federate client secret is shown only once when created. If you lose it, you must generate a new one in Federate and redo the steps below.

Store the Secret

aws secretsmanager put-secret-value \
  --secret-id <stage-client-id>/federate-client-secret \
  --secret-string '{"FEDERATE_CLIENT_SECRET": "<YOUR_SECRET>"}' \
  --region <region> --profile iss-metaflow-<stage>

Update Cognito

aws cognito-idp update-identity-provider \
  --user-pool-id <USER_POOL_ID> \
  --provider-name <IDP_PROVIDER_NAME> \
  --provider-details '{
    "client_id": "<CLIENT_ID>",
    "client_secret": "<YOUR_SECRET>",
    "authorize_scopes": "openid email profile",
    "attributes_request_method": "GET",
    "oidc_issuer": "https://idp.federate.amazon.com"
  }' \
  --region <region> --profile iss-metaflow-<stage>
Pre-filled commands. Both commands above are provided in the auth stack CfnOutputs as FederateSecretStoreCommand and FederateCognitoUpdateCommand — just substitute <YOUR_SECRET>.
Verification

Login and Verify

ar-metaflow login --account <ACCOUNT_ID> --region <region>

Source the environment and add to your shell profile:

source ~/.metaflowconfig/env.sh

# Add to ~/.zshrc or ~/.bashrc (one-time)
[ -f ~/.metaflowconfig/env.sh ] && source ~/.metaflowconfig/env.sh

Run a test flow:

python hello_flow.py run                       # local — uses remote metadata
python hello_flow.py --with kubernetes run     # remote — steps run as K8s pods
Full-stack smoke test. If the Kubernetes run succeeds, your entire stack is working: DNS, TLS, Cognito/Federate auth, EKS, Karpenter, and the Metaflow metadata service.

Metaflow UI

https://<stage>.metaflow.simulation.amazon.dev
Path Purpose
/ Metaflow UI — run browser, DAG visualizer, artifact inspector
/api/ Backend API (used internally by the UI)
/cli/ Metadata service endpoint (used by the Metaflow CLI)
Teardown

Removing a Deployment

Destroy stacks in reverse order. For isProd: true stages, disable termination protection first.

bb cdk destroy <stage>-ingress
bb cdk destroy <stage>-eks
bb cdk destroy <stage>-auth
bb cdk destroy <stage>-data
bb cdk destroy <stage>-dns
bb cdk destroy <stage>-vpc
Retained resources. These survive stack destruction and must be deleted manually:
  • Aurora RDS — disable deletion protection, delete instances, delete cluster
  • S3 artifact bucket — delete all object versions + delete markers, then delete bucket
  • Route53 CNAME — delete ACM validation records before the hosted zone
  • VPC endpoints & GuardDuty SGs — delete endpoints, wait for ENIs to release
Troubleshooting

Common Issues

invalid_client error during login

Federate client secret is missing or incorrect. Re-run FederateSecretStoreCommand + FederateCognitoUpdateCommand from auth stack outputs. Ensure you're using Prod Federate (not Integ).

UI shows “Looking for connection…”

The Metaflow UI backend isn't reachable. Verify the ingress stack deployed, check DNS resolution (nslookup <stage>.metaflow.simulation.amazon.dev), and inspect EKS pods (kubectl get pods -n metaflow).

DNS not resolving after deploy

NS delegation takes 2–5 minutes. Flush local cache: sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder

cdk deploy hangs silently

Check cdk.json has "requireApproval": "never". Verify ada credentials haven't expired.

Credential error during EKS deploy

EKS takes 15–20 minutes. If ada tokens expire, the stack rolls back. Re-authenticate and re-run — CDK resumes from the last successful state.

401 Unauthorized running flows

JWT tokens expire after ~1 hour. Re-run:

ar-metaflow login --account <ACCOUNT_ID> --region <region>
source ~/.metaflowconfig/env.sh

“Export not found” during deploy

Stacks deployed out of order. Deploy in sequence: VPC → DNS → Data → Auth → EKS → Ingress.

cdk8s instanceof mismatch

Duplicate cdk8s packages in node_modules. Run npm run clean && npm install && npm run build.

Synth fails with “provider name too long”

Cognito identity provider names have a 32-character limit. Shorten your idpProviderName in the stage config.