Deploy Your Own
Full operator guide for deploying an isolated RDE Workflow Engine environment in your AWS account using CDK. Covers infrastructure, authentication, and day-two operations.
What Gets Deployed
The @amzn/ar-metaflow-cdk-constructs package deploys a complete Metaflow platform as a
stack-per-concern architecture. Each stack is independently deployable and updateable.
| Stack | CDK Class | Purpose | Time |
|---|---|---|---|
| VPC | VpcStack |
VPC with public/private subnets, secondary CIDR for pods, VPC endpoints | 3–5 min |
| DNS | DnsStack |
Stage subdomain, cross-account NS delegation, wildcard ACM certificate | 2–3 min |
| Data | DataStack |
Aurora PostgreSQL (metadata store) + S3 bucket (artifact store) | 5–8 min |
| Auth | AuthenticationStack |
Cognito User Pool with Amazon Federate OIDC integration | 2–3 min |
| EKS | EksClusterStack |
EKS 1.34 Auto Mode, Argo Workflows/Events, External Secrets, Kyverno, Metaflow Helm, GPU/CPU node pools | 15–20 min |
| Ingress | IngressStack |
Gateway API, Envoy Gateway, External DNS, NLB with TLS, OIDC auth | 5–8 min |
Deploy order: VPC → DNS → Data → Auth → EKS → Ingress
RETAIN deletion
policy — your metadata and artifacts survive stack updates and accidental deletions. The compute layer
(EKS, Ingress) can be destroyed and recreated without data loss.
Before You Start
AWS Account Setup
-
AWS account provisioned and CDK bootstrapped (
cdk bootstrap aws://<ACCOUNT_ID>/<REGION>) - IAM role
IibsAdminAccess-DO-NOT-DELETEavailable in the target account - Root DNS zone deployed via
RootDnsStack(one-time, in the DNS account) - Sufficient service quotas for EKS, GPU instances (g5/g6/p4d), and NLB
CLI Tools
| Tool | Version | Purpose |
|---|---|---|
aws |
v2+ | AWS API calls, CDK deploy |
ada |
any | Amazon credential helper (Conduit) |
kubectl |
1.28+ | Kubernetes cluster verification |
brazil-build |
any | Amazon build system |
node |
18+ | CDK synthesis |
python |
3.10+ | Login script execution |
mwinit |
any | Midway SSO refresh |
CDK Configuration
Ensure your cdk.json includes:
{
"requireApproval": "never"
}
us-east-1, us-west-1, and
ca-central-1 have AZs where EKS cannot place control plane components. Run
bb app run dev-utils:context to populate AZ context before first synthesis.
Define Your Stage
Add an entry to the STAGES map in lib/constants.ts:
export const STAGES: Record<string, StageConfig> = { "my-stage": { accountId: "123456789012", region: "us-west-2", isProd: false, oidcConfig: { clientId: "iss-metaflow-my-stage", idpProviderName: "iss-metaflow-my-stage", issuerUrl: "https://idp.federate.amazon.com", }, }, };
| Field | Required | Description |
|---|---|---|
accountId |
Yes | 12-digit AWS account ID |
region |
Yes | AWS region (e.g., us-west-2) |
isProd |
Yes | Enables termination protection and deletion protection |
oidcConfig.clientId |
Yes | Cognito app client name — must match Federate registration |
oidcConfig.idpProviderName |
Yes | Cognito identity provider name (≤32 characters) |
oidcConfig.issuerUrl |
Yes | Always https://idp.federate.amazon.com (production) |
iss-metaflow-<stage> for both
clientId and idpProviderName. The issuerUrl is always the production
Federate endpoint — even for non-prod stages.
First-Time Deployment
Authenticate
ada credentials update \ --account <ACCOUNT_ID> \ --provider conduit \ --role IibsAdminAccess-DO-NOT-DELETE \ --once \ --profile iss-metaflow-<stage>
Build
brazil-build release
Deploy Stacks
Deploy in order. Each command is idempotent — safe to re-run on failure.
bb cdk deploy <stage>-vpc # 3-5 min bb cdk deploy <stage>-dns # 2-3 min bb cdk deploy <stage>-data # 5-8 min bb cdk deploy <stage>-auth # 2-3 min ← record CfnOutputs after this bb cdk deploy <stage>-eks # 15-20 min bb cdk deploy <stage>-ingress # 5-8 min # Total: ~30-45 min
FederateClientId,
FederateRedirectUri, FederateSecretStoreCommand,
FederateCognitoUpdateCommand, FederateKillSwitchRoleArn.
ada credentials update --once immediately before this stack to avoid credential expiry
mid-deploy.
Register with Amazon Federate
One-time manual step after the auth stack deploys. Connects Cognito to Amazon's corporate identity provider.
Create OIDC profile on Federate
Go to prod.ep.federate.a2z.com and create a new OIDC profile:
- Type: Corporate
- Template: Pre-Approved (Federate-Cognito)
- Interface: Web interface
- Device: Company laptops
- Client ID: paste
FederateClientIdfrom CfnOutputs - Redirect URI: paste
FederateRedirectUrifrom CfnOutputs - PKCE: Off · Client Secrets: On
Configure Logout & Access
- Add Logout Configuration → set IAM Role ARN from
FederateKillSwitchRoleArn - Add your team's POSIX/LDAP groups for access control
Save and copy the client secret
The secret is displayed exactly once. Copy it immediately.
Store the Secret
aws secretsmanager put-secret-value \ --secret-id <stage-client-id>/federate-client-secret \ --secret-string '{"FEDERATE_CLIENT_SECRET": "<YOUR_SECRET>"}' \ --region <region> --profile iss-metaflow-<stage>
Update Cognito
aws cognito-idp update-identity-provider \ --user-pool-id <USER_POOL_ID> \ --provider-name <IDP_PROVIDER_NAME> \ --provider-details '{ "client_id": "<CLIENT_ID>", "client_secret": "<YOUR_SECRET>", "authorize_scopes": "openid email profile", "attributes_request_method": "GET", "oidc_issuer": "https://idp.federate.amazon.com" }' \ --region <region> --profile iss-metaflow-<stage>
FederateSecretStoreCommand and FederateCognitoUpdateCommand — just substitute
<YOUR_SECRET>.
Login and Verify
ar-metaflow login --account <ACCOUNT_ID> --region <region>
Source the environment and add to your shell profile:
source ~/.metaflowconfig/env.sh # Add to ~/.zshrc or ~/.bashrc (one-time) [ -f ~/.metaflowconfig/env.sh ] && source ~/.metaflowconfig/env.sh
Run a test flow:
python hello_flow.py run # local — uses remote metadata python hello_flow.py --with kubernetes run # remote — steps run as K8s pods
Metaflow UI
https://<stage>.metaflow.simulation.amazon.dev
| Path | Purpose |
|---|---|
/ |
Metaflow UI — run browser, DAG visualizer, artifact inspector |
/api/ |
Backend API (used internally by the UI) |
/cli/ |
Metadata service endpoint (used by the Metaflow CLI) |
Removing a Deployment
Destroy stacks in reverse order. For isProd: true stages, disable termination protection first.
bb cdk destroy <stage>-ingress bb cdk destroy <stage>-eks bb cdk destroy <stage>-auth bb cdk destroy <stage>-data bb cdk destroy <stage>-dns bb cdk destroy <stage>-vpc
- Aurora RDS — disable deletion protection, delete instances, delete cluster
- S3 artifact bucket — delete all object versions + delete markers, then delete bucket
- Route53 CNAME — delete ACM validation records before the hosted zone
- VPC endpoints & GuardDuty SGs — delete endpoints, wait for ENIs to release
Common Issues
invalid_client error during login
Federate client secret is missing or incorrect. Re-run FederateSecretStoreCommand +
FederateCognitoUpdateCommand from auth stack outputs. Ensure you're using
Prod Federate (not Integ).
UI shows “Looking for connection…”
The Metaflow UI backend isn't reachable. Verify the ingress stack deployed, check DNS resolution (nslookup <stage>.metaflow.simulation.amazon.dev), and inspect EKS pods (kubectl get pods -n metaflow).
DNS not resolving after deploy
NS delegation takes 2–5 minutes. Flush local cache:
sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder
cdk deploy hangs silently
Check cdk.json has "requireApproval": "never". Verify ada credentials
haven't expired.
Credential error during EKS deploy
EKS takes 15–20 minutes. If ada tokens expire, the stack rolls back. Re-authenticate and
re-run — CDK resumes from the last successful state.
401 Unauthorized running flows
JWT tokens expire after ~1 hour. Re-run:
ar-metaflow login --account <ACCOUNT_ID> --region <region> source ~/.metaflowconfig/env.sh
“Export not found” during deploy
Stacks deployed out of order. Deploy in sequence: VPC → DNS → Data → Auth → EKS → Ingress.
cdk8s instanceof mismatch
Duplicate cdk8s packages in node_modules. Run
npm run clean && npm install && npm run build.
Synth fails with “provider name too long”
Cognito identity provider names have a 32-character limit. Shorten your idpProviderName in the
stage config.