← kevin.energy Essay · No. 03

Essay

Does the mental model that runs A/B testing also discipline AI exploration?

Building 40 AI prototypes for a portfolio site revealed something: the rigor that separates good experiments from bad ones is the same whether you're running a conversion test or generating a hundred creative variants.

Kevin Nguyen · March 2026 · 5 min

I recently built a portfolio site by generating 40 different prototypes with Claude — fluid simulations, particle systems, magnetic field visualizations, all kinds of stuff. The process that made it work was not some novel AI methodology. It was the same logic that governs A/B testing: hypothesis, variant, evaluation. The unsettling thing is how many people treat AI exploration as something fundamentally new, when the discipline required to do it well is already sitting in the standard PM toolkit, waiting to be applied at a larger scale.

That is this essay's whole question. If you have ever run a rigorous A/B test — defined your metric before seeing results, resisted the pull of an early favorite, killed a variant you personally loved because the data disagreed — you have already practiced the cognitive discipline that makes AI exploration productive. You might just not have recognized the transfer yet.

§ 01 — The parallel

The same instinct, a very different scale

In A/B testing, the whole point is that you don't know which version will perform better. So you run both, measure, and let the data decide. You resist the urge to go with your gut before you have seen the results.

Working with AI on creative projects is the same problem in a different register. You don't know which direction is best until you have seen enough options. The difference is that AI lets you test at a scale that would have been absurd before. Instead of A versus B, you can run A through Z — and you should, because the best idea is rarely the first one or the second one. Forty variants isn't excess. It's sample size.

A/B Testing

Two variants. Hypothesis-driven. Success metric defined upfront. Winner picked by data. ████████░░░░░░░░░░░░

Same Logic

Hypothesis → variant → evaluation. Discipline lives in criteria, not in generation speed. ████████████████████

AI Exploration

Forty variants. Same hypothesis discipline. Multi-criteria scoring. Winner emerges from evaluation. ████████████░░░░░░░░
FIG_001 — A/B testing vs. AI exploration: the underlying logic. © 2026 Scale changes but structure doesn't. Both methods collapse without upfront criteria — and both reward the person who resists rationalization after seeing the first promising result.

§ 02 — Where it maps

Four places the framework transfers directly

Hypothesis-driven exploration. In A/B testing, each variant starts with a hypothesis: "a shorter headline will convert better." In my process, each prototype started with one too: "what if scrolling changed the physics instead of the color?" or "what if the layout was asymmetric?" The hypothesis is what makes each variant worth building — not just different, but meaningfully different from what came before.

Predefined success metrics. You don't launch an A/B test without knowing what you're measuring. I didn't evaluate 40 prototypes without criteria either. I scored each one on interaction quality, technical depth, concept, performance, portfolio fit, and emotional register. Decide what "winning" means before you see the results, so you are not rationalizing after the fact. That rule does not change when the variants are AI-generated instead of hand-coded.

Statistical humility. Good A/B testers know that early results lie. A variant that looks great after 100 visitors might look average after 10,000. The prototype that impressed me most on first viewing — Magnetic Field, E26 — turned out to lack a content layer that made it actually usable as a portfolio. First impressions aren't conclusions. The scoring framework is there precisely to override the impression.

Killing your darlings with data. The hardest part of A/B testing is shutting down a variant you personally like because the numbers say otherwise. Some of my favorite explorations scored poorly on portfolio fit or performance. The scoring framework gave me permission to let them go without second-guessing. That is not cruelty toward the work — it is how you prevent the best-looking option from crowding out the most effective one.

The discipline isn't in generating the variants. It's in how you evaluate them — and that discipline is identical whether you're running a two-cell experiment or a forty-variant AI exploration.

§ 03 — Where it diverges

The one place the parallel breaks down

A/B tests pick a winner. My process picked a cast.

The top six explorations didn't compete for one slot. They each filled a different role: the ship-ready option, the memorable alternative, the safe fallback, the conceptual peak. That's less like A/B testing and more like casting a movie. You're not looking for six versions of the same character — you're looking for a team where each member does something the others can't.

01
Ship-ready option
High portfolio fit, clean performance, immediately usable. The one you ship if everything else is equal.
Role: default
02
Memorable alternative
Highest concept score, strong emotional register, lower portfolio fit. The one that sticks in the interviewer's memory.
Role: contrast
03
Safe fallback
Conservative, reliable, unlikely to read as risky. The option you reach for in a conservative context.
Role: hedge
04
Conceptual peak
The most technically interesting. Not the most deployable. Demonstrates range, not judgment. Keep it in the drawer.
Role: proof of range
FIG_002 — The top-six cast: role-based selection, not rank-based elimination. © 2026 The winning variant in a role-based selection is not the one that scores highest overall — it is the one whose specific strength is most needed in context. This is a harder evaluation than a simple ranking.

This is maybe the most useful takeaway for anyone working with AI on open-ended problems. Don't just rank your outputs from best to worst. Ask what unique job each one does. You'll often find that the "winner" is obvious once you frame it that way — and it may not be the one that scored highest on the aggregate rubric.

§ 04 — The exploration loop

How the process actually ran

The forty-prototype process was not freeform. It followed a structure that would be recognizable to anyone who has run a multi-cell experiment. Each cycle started with a hypothesis, generated one or more variants against it, scored them on the predefined rubric, and used the results to sharpen the next hypothesis. Weak scores in one dimension redirected the next round rather than causing the whole process to restart.

Hypothesize WHAT IF SCROLLING CHANGED THE PHYSICS? Generate 1–N VARIANTS AGAINST THE HYPOTHESIS Score 6-DIMENSION RUBRIC · REDIRECT NEXT CYCLE SHARPEN · REPEAT
FIG_003 — The exploration loop: hypothesize, generate, score, redirect. © 2026 Each scoring cycle feeds the next hypothesis rather than terminating the process. Weak scores in one dimension — say, performance — redirect the next round toward variants that trade off concept score for speed. The loop tightens the space incrementally.

§ 05 — The scoring rubric

What "winning" meant before the variants were built

The rubric was defined before the first prototype ran. Six dimensions, equal weight, no retroactive adjustments after seeing results that changed the picture. That constraint is load-bearing: the moment you allow the metric to shift in response to what the data produced, you have introduced the same rationalization bias that corrupts A/B test interpretations when the experimenter peeks at results mid-run.

01
Interaction quality
Does the variant respond to the user in a way that feels intentional, not accidental?
Pre-defined before generation
02
Technical depth
Does it demonstrate craft — not just output, but architecture and choice?
Pre-defined before generation
03
Concept
Is there an idea here, or just execution? The best variants had both.
Highest variance dimension
04
Performance
Frame rate, load time, battery impact. Non-negotiable floor: if it stutters, it's out.
Pre-defined before generation
05
Portfolio fit
Does it communicate PM judgment, not just engineering? The filter that eliminated E26 (Magnetic Field) despite its high concept score.
The dimension that killed favorites
06
Emotional register
What does it feel like to spend 10 seconds with it? Calm, anxious, impressed, bored?
Pre-defined before generation
FIG_004 — The six-dimension scoring rubric, defined before generation began. © 2026 Portfolio fit (dimension 05) did the most elimination work. Magnetic Field (E26) — the highest first-impression scorer — failed here: it lacked a content layer that made it legible as a PM's work rather than an engineer's demo.

Predefined criteria aren't bureaucracy — they're the thing that lets you trust a result you don't personally like. Without them, you're just rationalizing whatever already looked best.

§ 06 — The answer

So, does the mental model transfer?

Yes — completely and immediately, with one extension. The A/B testing framework transfers to AI exploration without modification on every dimension except selection strategy. Hypothesis-driven variants, predefined criteria, statistical humility, and the willingness to kill well-liked options all apply directly. The only thing that changes is that AI scale makes role-based selection possible in a way that two-cell experiments cannot support: instead of picking the best variant, you pick the best cast.

That extension is actually an upgrade. Role-based selection is harder than rank-based selection — it requires you to articulate what specific job each finalist is doing, not just which one scored highest overall. But it produces better outcomes in open-ended creative work, because the "best overall" option is rarely the best option for every context the output will face.

The deeper answer is about where the discipline lives. Most conversation about AI tools focuses on generation: how to prompt, which model, which temperature. That focus is misplaced. Generation is easy — it happens at the speed of inference. The constraint is evaluation. Forty variants with no rubric is noise. Two variants with a clear rubric is science. The mental model that makes A/B testing rigorous is exactly the thing that makes AI exploration productive, because in both cases the hard work is deciding what you were looking for before you saw what you got.

If your team already thinks in experiments, hypotheses, and metrics, you are better prepared for creative AI work than you think. The skills transfer directly. The only shift is recognizing that AI lets you run your experiments at a scale that changes what is possible — not just faster versions of the same two options, but a genuinely broad exploration of the space. And just like with A/B testing, the discipline isn't in generating the variants. It's in how you evaluate them.

Kevin Nguyen is a product manager exploring consumer, fintech, and AI. The 40-prototype exploration described here produced the design system behind kevin.energy. Connect on LinkedIn →

╌╌  END  ╌╌