2026

2026년 6월 12일

Same Same but Different: The Anatomy of AI Design Sameness

Hand an agent your art direction and it still builds the page everyone else got. That is not a taste problem, it is a statistics problem. A look at the history of design convergence, the research that explains why prompts cannot escape it, and the levers that actually steer back toward identity.

Sascha Becker

Author

약 24분

Same Same but Different: The Anatomy of AI Design Sameness

Let an agent build you a website and you can tell. Not because it is bad. Because it is familiar. The dark hero with the radial glow. The centered headline in Inter. The two buttons, one filled, one outlined. The three feature cards with the little icons. The indigo. Always the indigo.

Bad tongues call it AI slop. The term is doing a lot of work, but the observation underneath it is real: AI-generated interfaces carry a statistical fingerprint that makes them identifiable on sight, the way a stock photo is identifiable on sight. And the usual responses to it are either resignation ("that's just what AI looks like") or the essay I refuse to write, the ten thousandth meditation on how AI lacks artistic soul.

Both responses are wrong, because both treat the sameness as a mystery or a metaphysics. It is neither. It is a mechanism. The convergence is measurable, the causes are published, the psychology behind it is older than the web, and the research on how to steer out of it is moving fast. This post is an attempt to treat the topic the way it deserves to be treated: as a system with parts you can name.

We Have Built This Sameness Before

The current limbo feels new, but it is the third convergence wave in living memory, and the previous two teach us most of what we need.

Wave	Years	Driver	What everything looked like
Material 1 apps	2014 to 2018	One design language, limited theming	White surfaces, FAB, drawer, brand color as accent
Template web	2010 to 2019	Frameworks, CMS templates, responsive grids	Hero, three cards, testimonial band, footer columns
Agentic interfaces	2023 to now	Shared training corpus, shared scaffolds	Dark glow hero, Inter, indigo buttons, card grid

Google announced Material Design at I/O in June 2014, and it was genuinely good: a coherent physics of surfaces, motion and depth that rescued Android from its Holo-era wilderness. But the first version shipped with limited room for deviation, and the path of least resistance led everywhere to the same place. Apps shed their personalities for white backgrounds and a single accent color. The criticism grew loud enough that Google's 2018 refresh, Material Theming, was explicitly framed as the answer to apps looking too similar¹. What was intended as an invitation to craft became a template for uniformity, and it took Google four years of new tooling and guidelines to partially walk it back. Apple ran the same experiment with a looser leash: the human interface guidelines left more room for art direction, and iOS apps of that era did vary more. But the flat-design wave that iOS 7 kicked off in 2013 produced its own sea of interchangeable white screens. More freedom, same attractor, slower drift.

One identical app layout shown three times, each with a different theme color transforming its whole appearance — Google's 2018 answer to the sameness it had built: Material Theming, one layout, three identities. Source: Material Design 2 documentation.

The web did not even need a design language to converge. It needed templates, frameworks and a decade. The cleanest evidence is a CHI 2021 study by Sam Goree and colleagues at Indiana University, who ran computer vision over more than 227,000 screenshots of roughly 10,000 websites spanning 2003 to 2019. Sites actually grew more diverse until about 2007. Then the trend reversed, hard: average layout distance between websites fell by 44 percent between 2010 and 2019². The drivers their interviewees named are a familiar list: shared frameworks and libraries (the study found library adoption strongly correlated with visual similarity), responsive design collapsing layouts onto stackable columns, CMS templates, and SEO and conversion practices dictating what goes above the fold.

Line chart of average pairwise distance between websites from 2003 to 2020 for color, layout, and CNN features, with all three metrics declining sharply after 2010 — The web converging, measured: average pairwise distance between popular websites, 2003 to 2020. Lower means more alike, and all three metrics fall through the 2010s, years before generative AI. Figure from Goree, Doosti, Crandall and Su, CHI 2021.

Boris Müller had already diagnosed the same thing from the designer's side in his 2018 essay on the visual weariness of the web: templates are content-agnostic, and content-agnostic form is the opposite of design, because the deep connection between form and content is severed by construction³.

Hold on to the dates. The web converged dramatically between 2010 and 2019. There was no generative AI in that loop. Whatever is happening now, AI did not invent it.

Average Is Beautiful, Literally

Why would design converge even without a machine pushing it? Because the human visual system rewards typicality, and it does so below the level of opinion.

Psychologists call the underlying account processing fluency: the easier a stimulus is to process, the more we like it, and familiarity, symmetry and prototypicality all make processing easier⁴. The effect shows up so reliably that it has its own name, beauty-in-averageness: prototypes are rated as more attractive, and experiments tracing the effect find that fluency is doing the mediating work. Average things are literally easy on the mind⁵.

This is why every pre-AI convergence wave happened. A designer A/B tests two layouts, and the more familiar one converts better, because visitors parse it faster. A founder picks the template that "feels professional", which is to say, the one that resembles the last hundred sites they trusted. Raymond Loewy compressed the whole dynamic into a slogan decades ago: most advanced, yet acceptable. Novelty sells only up to the edge of familiarity.

So sameness is not a failure state of design culture. It is an attractor that design culture falls into whenever the cost of deviation rises or the reward for typicality grows. Every system that optimizes for immediate human approval will drift toward the prototype. Remember that sentence, because we are about to meet a training pipeline whose entire job is to optimize for immediate human approval.

The Machine of the Mean

Now the AI part, and the reason prompting harder does not get you out.

A language model is a compressed statistical portrait of its training corpus. When it generates, it samples from a probability distribution over continuations, and the safest probability mass sits near the mode: the most typical way the internet has ever built a landing page. For base models this distribution is actually fairly wide. The narrowing comes later, and we know where, because researchers have taken the pipeline apart stage by stage.

Robert Kirk and colleagues showed at ICLR 2024 that RLHF, the alignment stage where models are tuned against human preferences, substantially reduces output diversity compared to supervised fine-tuning, across models and across diversity metrics. They describe it as a direct trade-off: alignment buys generalisation and pays in variety⁶. A 2025 Stanford and Northeastern paper put a name on the root cause: typicality bias. Human annotators, asked which of two outputs they prefer, systematically favor the more familiar, more fluent, more prototypical one, exactly as the fluency literature predicts. That bias flows into the reward model, and the reward model then teaches the LLM that the mode is what humans want. The authors show this sharpening persists even with a perfect reward signal⁷.

Read that again, because it is the keystone of this whole topic. The same cognitive bias that made human-built websites converge through A/B tests and template marketplaces has been distilled into a reward function and applied at industrial scale. Mode collapse is not an alien machine pathology. It is beauty-in-averageness with a training budget.

And the result is no longer anecdotal. A NeurIPS 2025 paper introduced Infinity-Chat, a dataset of 26,000 real open-ended user queries, and evaluated more than 70 models on them. The finding earned a Best Paper Award and a memorable name, the Artificial Hivemind effect: individual models repeat themselves, and, worse, different models produce strikingly similar outputs to one another. Ask seventy models, get one aesthetic⁸.

Heatmap showing, for around two dozen language models, the share of generated outputs falling into each similarity band, with most of the mass concentrated in the 0.8 to 1.0 bands — The Artificial Hivemind, quantified: for nearly every model tested, the largest share of outputs lands in the 0.8 to 1.0 similarity bands. Figure from Jiang et al., NeurIPS 2025.

The homogenization then propagates into human work. In a large experiment published in Science Advances, writers given AI-generated ideas produced stories that judges rated as more creative, with the biggest gains going to the least creative writers, but the AI-assisted stories were measurably more similar to each other than the unassisted ones. The authors call it a social dilemma: individually you are better off taking the model's help, collectively the pool of novelty shrinks⁹. A study at Creativity & Cognition 2024 found the same shape in ideation: users brainstorming with ChatGPT produced less semantically distinct ideas across users than those using a non-LLM creativity tool, and felt less ownership of the result¹⁰. And an ICLR 2024 study of co-writing found the telling detail: essays written with a feedback-tuned model showed significantly reduced content diversity, while essays written with the base model did not. The narrowing force is not the language model. It is the alignment¹¹.

Two more pieces complete the machine. First, monoculture: when many decision-makers share one algorithm, outcomes correlate across the whole ecosystem, a dynamic Kleinberg and Raghavan formalized for hiring and Bommasani and colleagues confirmed empirically for shared models and datasets¹²¹³. Essentially the entire industry now designs through three or four frontier models. The correlation is not a risk. It is the setup. Second, the loop closes: AI-generated sites get deployed, scraped, and folded into the next training corpus. The Nature paper on model collapse showed what indiscriminate recursive training does to a distribution: the tails vanish first, irreversibly¹⁴.

The tails are where the identity lives

In distribution terms, the weird personal site, the hand-lettered nav, the portfolio that breaks the grid on purpose: these are tail events. Rare by definition, under-represented in every corpus, and the first thing recursive training erodes. The sameness machine does not attack identity. It simply forgets it, one training run at a time.

Why Your design.md Cannot Save You

The standard advice is to write better art direction into the context: a design.md, a brand voice document, a stack of adjectives in the CLAUDE.md. I do this. You probably do too. It helps less than it should, and the distribution view explains exactly why.

A prompt does not extend the model's distribution. It conditions it. You are selecting a region of what the model already contains, and then the prior fills in every decision you did not explicitly make. A landing page involves thousands of micro-decisions: spacing rhythm, weight contrast, corner radii, how a section opens, what an empty state says. Your design.md pins perhaps thirty of them. The other thousands come from the mode.

Worse, the words you pin them with are themselves the mode's vocabulary. "Clean, modern, minimal, bold, with personality" is not a fingerprint, it is the statistically densest region of all design writing on the internet. Every brand document says it. Conditioning on the most common adjectives in the corpus lands you in the most common designs in the corpus, now wearing your brand color. Same same, but different. And if you asked the model to write the design.md in the first place, you have conditioned the mode on the mode. The document that was supposed to encode your identity is itself a sample from the distribution you are trying to escape.

The mechanics get comically concrete. In August 2025, Tailwind's creator Adam Wathan publicly apologized for making every button in Tailwind UI bg-indigo-500 five years earlier, "leading to every AI generated UI on earth also being indigo"¹⁵. He was joking, and he was right. One default in one popular component library, repeated across millions of tutorials and repos, became a high-probability token sequence, and now an entire generation of generated interfaces reaches for the same color the way water reaches for the lowest point. The same gravity operates at the stack level, where every coding agent converges on React, Tailwind and shadcn/ui regardless of vendor; I took that apart in a separate post, and the mechanism is identical, just one layer down.

Adam Wathan's post apologizing for bg-indigo-500, quote-tweeted by Kevin Kern with two screenshots of purple AI-generated interfaces and the line: So, GPT-5 hasn't solved the purple problem — One default, one apology, 1.5 million views. The quote tweet documents the purple problem surviving into the next model generation. Screenshot from X, August 2025.

This is why "heavy art direction in the claude.md" feels like it forces you down a narrow path others have traveled: it does. The path was paved before you arrived. Instructions choose among existing roads. They do not build new ones.

The Research on Steering Back

Here is the encouraging part, and the reason this post is not another lament. Mode collapse turns out to be, at least partially, an interface problem rather than a capability loss. The diversity is still in the model. The question being worked on, at four different layers, is how to get it back out.

At the decoding layer. The verbalized sampling work mentioned above is the most practical result of the last year. Instead of asking for one answer, you ask the model to verbalize a distribution: five candidate directions, each with its estimated probability. Because the typicality sharpening is strongest on single-answer prompts, asking for the distribution routes around it. The authors measure 1.6 to 2.1 times higher diversity in creative tasks, training-free, on models you already use⁷. Alongside it sit sampler-level methods like min-p sampling, an ICLR 2025 oral, which truncates the token distribution dynamically so you can run hotter temperatures without losing coherence, though a follow-up analysis contests how large the gains really are¹⁶¹⁷.

Diagram contrasting direct prompting, which collapses to the single most typical answer, with verbalized sampling, which asks for five responses with probabilities and recovers the spread of the distribution — Mode collapse as an interface problem: a direct prompt returns the mode, asking for a verbalized distribution returns the spread. Figure from Zhang et al., 2025.

At the training layer. Meta's Diverse Preference Optimization rewires the preference-tuning step itself: instead of always preferring the highest-scored response, it picks rare-but-good responses as the chosen examples and common-but-mediocre ones as rejected. The result in their experiments is a 45.6 percent increase in diversity of generated persona attributes and a 74.6 percent increase in story diversity at comparable quality¹⁸. Related work argues homogenization should be treated as task-dependent: you want one answer for a factual query and a spread for an aesthetic one, and training should learn the difference¹⁹. None of this is in your hands as a user, but it matters strategically: the labs now treat diversity loss as a defect to engineer against, not a cosmetic complaint.

At the representation layer. Anthropic's persona vectors work identifies directions in a model's activation space that correspond to character traits, and shows they can be monitored and steered²⁰. Today the published traits are things like sycophancy, not visual taste. But the underlying idea, identity as a controllable direction rather than a prompt paragraph, is the most interesting long bet in this list. A future where "your studio's eye" is a vector you apply to a model, the way image people apply a style LoRA trained on their own work today, is technically continuous with what already exists.

At the workflow layer. This is where HCI research is most useful right now. Luminate, a CHI 2024 system, has the model first generate the dimensions of a design space and then populate it, so the human explores a structured field of genuinely different options instead of accepting the first sample²¹. A 2025 study showed that seeding LLMs with distinct personas measurably mitigates the homogenization effect in human-AI ideation²². The shared shape of all of it: diversity is recovered when the human stops being a consumer of one answer and becomes a curator of many.

Getting Your Identity Back, Today

The research above will reshape the tools. But you are shipping this quarter, with the models that exist. Here is what actually moves the needle, ordered by leverage, and every step is consistent with the mechanics above rather than wishful prompting.

1. Make the identity outside the model

The strongest signal you can give an agent is not prose, it is artifact. Models mirror the project they are dropped into far more faithfully than they follow abstract instructions. So decide the design language before the agent arrives: your palette as tokens, your two typefaces, your spacing scale, your border radii, your motion rules, encoded in a theme file or design tokens. Hand-build the first screen yourself, at whatever fidelity you can, and let the agent extend it. You are seeding the local distribution. From the model's perspective, your repo becomes the corpus.

2. Replace adjectives with artifacts

"Warm and editorial" selects a mode. A screenshot of the exact magazine spread you mean selects a point. Reference images, links to three sites that share the feeling you want, your own previous work: all of these condition the model on something it cannot get from the adjective cloud. If your agent supports vision, design reviews against screenshots beat design instructions in prose every time. Point at things. Words are where the mode hides.

3. Constrain by exclusion

The mode is a known place, so you can name it and fence it off. No Inter. No indigo, no violet gradients. No centered hero with two buttons. No three-card feature row. No glassmorphism. Negative constraints work better than positive ones because they do not rely on the model interpreting taste; they carve away the highest-probability region and force the sampler to land somewhere else. This is the single cheapest intervention on this list relative to its effect.

4. Generate wide, choose narrow

Apply the verbalized sampling result manually: never accept the first design. Ask for five directions that are explicitly forbidden from sharing a layout, a palette or a typographic idea, and ask the model to say which one is the most conventional, then discard that one first. Then act as the art director: pick, combine, and reject. The Science Advances study is clear that the model lifts the floor; the studies on homogenization are equally clear that the ceiling, the distinctive choice, comes from the human doing the curating. Taste is the input the model cannot supply, because taste is exactly what got averaged out of it.

5. Keep one hand-made thing per page

The tails are hand-made by definition. A drawn icon, a custom cursor, a strange 404, a chart that looks like nothing else, the navigation idea no template has. One genuinely hand-crafted element does more for perceived identity than a hundred prompt tokens, because it is the one part of the page that provably did not come from the distribution. It is also, not coincidentally, the part visitors remember.

What does not work

Stacking more adjectives into the design.md. Asking for "something less generic" (you will get the mode with a different accent color). Asking the model to invent your brand identity from scratch (you are sampling the thing you are escaping). Raising the temperature until coherence breaks. And trusting the model when it announces that the design is "unique and distinctive". Verify with your eyes, the way you would diff a PR.

The Mean Is a Starting Point, Not a Destination

The honest version of the conclusion has to grant the machine its due. The mode is the mode because, on average, people prefer it. It converts. It ships in an afternoon. For an internal tool, a prototype, a settings page, the average is not a defect, it is a gift; nobody needs an avant-garde admin panel. The Material 1 story teaches the same lesson from the other side: a strong shared baseline raised the floor of an entire ecosystem before it flattened the ceiling.

The problem is mistaking the floor for the building, and that mistake is now being made at the scale of the entire web, automatically, by default, with a feedback loop that erodes the alternatives out of future models. The sameness was always an attractor. What changed is that escaping it used to require ignoring a trend, and now it requires out-arguing a prior.

But the prior can be out-argued. The research says the diversity is still inside the models, recoverable by how we ask, trainable by better objectives, steerable as vectors, and explorable as design spaces. The history says convergence waves end when the tooling starts paying for deviation, the way Material Theming and the post-Bootstrap web eventually did. And the practice says that identity, today, costs exactly what it has always cost: making some decisions yourself, by hand, before the machinery arrives, and defending them when it pushes back.

AI-powered, not AI-squashed, is not a tooling feature you can wait for. It is a division of labor you have to decide on. The model supplies the average competently and instantly. You supply the reason anyone should remember the page. That part was never the machine's job, in 2014, in 2019, or now.

Sources

글쓴이

Sascha Becker

다른 글