Amazon Bedrock, 2026
Enabling enterprises to migrate models without losing trust, quality, or control

PROBLEM OVERVIEW
Amazon Bedrock is a managed GenAI platform used by enterprises running production AI systems in regulated, customer-facing environments. The central challenge is that foundation models evolve rapidly, but enterprise expectations for stability do not.
PROBLEM OVERVIEW
Amazon Bedrock is a managed GenAI platform used by enterprises running production AI systems in regulated, customer-facing environments. The central challenge is that foundation models evolve rapidly, but enterprise expectations for stability do not.
Problem Overview
Amazon Bedrock is a managed GenAI platform used by enterprises running production AI systems in regulated, customer-facing environments. The central challenge is that foundation models evolve rapidly, but enterprise expectations for stability do not.
Model upgrades change behavior with Gen AI in subtle ways. Prompts migrated across models or providers often underperform and require extensive re-optimization. For large organizations, “upgrade” becomes a major operational project, not a quick configuration change.
One customer spent 5400+ engineering hours on a single migration.
Model upgrades change behavior in subtle ways. Prompts migrated across models or providers often underperform and require extensive re-optimization. For large organizations, “upgrade” becomes a major operational project, not a quick configuration change.
Model upgrades change behavior in subtle ways. Prompts migrated across models or providers often underperform and require extensive re-optimization. For large organizations, “upgrade” becomes a major operational project, not a quick configuration change.
Enterprises upgrading AI models in production had no structured process. One customer spent 5,400 engineering hours on a single migration. I designed a three-gate control plane that consolidated four fragmented engineering practices into a repeatable one-to-two week lifecycle. The system surfaces analysis at every stage. Engineers choose the model. Nothing reaches production without an explicit human decision.
This work treated model migration as a change-management problem, not an optimization exercise. The primary failure modes designed against were silent behavior change, inefficient token utilization, and unexplainable shifts in output quality.
Model migration sits between system reliability and response quality. I designed for two primary personas.
Primary focus
Building reliable AI applications in production.
Responsibilities
Needs
Primary focus
Getting the best possible responses from the model.
Responsibilities
Needs
Shared Constraints
Both personas operate within model limits, balance cost and performance, and need validation before committing to a model change.
Model migration was designed as a decision lifecycle, not a workflow. The lifecycle consisted of three stages that progressively increased confidence while making cost and risk legible.

To start model migration, my users have to provide: invocation log data from their app (system logs of what an AI did and what the result was) and the models they want to target.
When they start the job, the system performs an Initial Evaluation to compare their source and selected target model performance. This stage acts as a quick decision point if model migration is worth moving forward.
WHAT USERS NEED HERE
The results screen surfaces performance measurements side-by-side. In this example, Claude Sonnet 4.5 shows higher accuracy and significantly lower latency and cost than Sonnet 3.5. The invocation log table below lets engineers drill into individual prompt pairs, seeing the source and optimized outputs side by side with accuracy deltas at the prompt level. Change is visible. My user decides whether the signal is strong enough to proceed.
My users have a variety of methods to provide prompt templates to the system. They can use our system to perform prompt optimization, or skip directly to shadow testing.
The goal of this stage is to improve relative quality and cost efficiency before deeper investment. Optimization is a lever, not a requirement, and should not block teams that want to evaluate with their own methods.
After providing optimized prompt templates and test data, users configure a traffic sample and duration to begin shadow testing. It produces production-like evidence without production impact.
WHAT USERS NEED HERE
Shadow testing runs the target models against live production traffic in parallel with the current model. Results stream in real time so engineers can monitor performance as the test runs. At the end, engineers make the final migration decision based on what they see.
Enterprise customers who reviewed the system reported it would significantly reduce the chaos they had experienced with model migrations. A process that was too risky to run repeatedly became a controlled, repeatable one-to-two week lifecycle.
Portfolio
Live Ops Alerting DashboardDesigning clarity for real-time operational decision-making
Career exploration for Workday's Career HubDesigning AI-assisted career exploration with human judgment at the center
AWS Glue StudioDesigning a configuration-driven data pipeline builder for enterprise scale
PartiQL Editor for Amazon QLDBUX Case Study
Asurion Virtual AgentUI Design
Enhanced chat for Chase mobileUI-UX design
Transaction details for Chase mobileUX Case study
Chase mobileUI-UX Design
Upgrade systems for Rival FireUX case study
UI for James Bond 007: World of espionageDesign system