2026년 6월 12일
Same Same but Different: The Anatomy of AI Design Sameness
Hand an agent your art direction and it still builds the page everyone else got. That is not a taste problem, it is a statistics problem. A look at the history of design convergence, the research that explains why prompts cannot escape it, and the levers that actually steer back toward identity.
Sascha Becker
Author약 24분

Let an agent build you a website and you can tell. Not because it is bad. Because it is familiar. The dark hero with the radial glow. The centered headline in Inter. The two buttons, one filled, one outlined. The three feature cards with the little icons. The indigo. Always the indigo.
Bad tongues call it AI slop. The term is doing a lot of work, but the observation underneath it is real: AI-generated interfaces carry a statistical fingerprint that makes them identifiable on sight, the way a stock photo is identifiable on sight. And the usual responses to it are either resignation ("that's just what AI looks like") or the essay I refuse to write, the ten thousandth meditation on how AI lacks artistic soul.
Both responses are wrong, because both treat the sameness as a mystery or a metaphysics. It is neither. It is a mechanism. The convergence is measurable, the causes are published, the psychology behind it is older than the web, and the research on how to steer out of it is moving fast. This post is an attempt to treat the topic the way it deserves to be treated: as a system with parts you can name.
We Have Built This Sameness Before
The current limbo feels new, but it is the third convergence wave in living memory, and the previous two teach us most of what we need.
| Wave | Years | Driver | What everything looked like |
|---|---|---|---|
| Material 1 apps | 2014 to 2018 | One design language, limited theming | White surfaces, FAB, drawer, brand color as accent |
| Template web | 2010 to 2019 | Frameworks, CMS templates, responsive grids | Hero, three cards, testimonial band, footer columns |
| Agentic interfaces | 2023 to now | Shared training corpus, shared scaffolds | Dark glow hero, Inter, indigo buttons, card grid |
Google announced Material Design at I/O in June 2014, and it was genuinely good: a coherent physics of surfaces, motion and depth that rescued Android from its Holo-era wilderness. But the first version shipped with limited room for deviation, and the path of least resistance led everywhere to the same place. Apps shed their personalities for white backgrounds and a single accent color. The criticism grew loud enough that Google's 2018 refresh, Material Theming, was explicitly framed as the answer to apps looking too similar1. What was intended as an invitation to craft became a template for uniformity, and it took Google four years of new tooling and guidelines to partially walk it back. Apple ran the same experiment with a looser leash: the human interface guidelines left more room for art direction, and iOS apps of that era did vary more. But the flat-design wave that iOS 7 kicked off in 2013 produced its own sea of interchangeable white screens. More freedom, same attractor, slower drift.

The web did not even need a design language to converge. It needed templates, frameworks and a decade. The cleanest evidence is a CHI 2021 study by Sam Goree and colleagues at Indiana University, who ran computer vision over more than 227,000 screenshots of roughly 10,000 websites spanning 2003 to 2019. Sites actually grew more diverse until about 2007. Then the trend reversed, hard: average layout distance between websites fell by 44 percent between 2010 and 20192. The drivers their interviewees named are a familiar list: shared frameworks and libraries (the study found library adoption strongly correlated with visual similarity), responsive design collapsing layouts onto stackable columns, CMS templates, and SEO and conversion practices dictating what goes above the fold.

Boris Müller had already diagnosed the same thing from the designer's side in his 2018 essay on the visual weariness of the web: templates are content-agnostic, and content-agnostic form is the opposite of design, because the deep connection between form and content is severed by construction3.
Hold on to the dates. The web converged dramatically between 2010 and 2019. There was no generative AI in that loop. Whatever is happening now, AI did not invent it.
Average Is Beautiful, Literally
Why would design converge even without a machine pushing it? Because the human visual system rewards typicality, and it does so below the level of opinion.
Psychologists call the underlying account processing fluency: the easier a stimulus is to process, the more we like it, and familiarity, symmetry and prototypicality all make processing easier4. The effect shows up so reliably that it has its own name, beauty-in-averageness: prototypes are rated as more attractive, and experiments tracing the effect find that fluency is doing the mediating work. Average things are literally easy on the mind5.
This is why every pre-AI convergence wave happened. A designer A/B tests two layouts, and the more familiar one converts better, because visitors parse it faster. A founder picks the template that "feels professional", which is to say, the one that resembles the last hundred sites they trusted. Raymond Loewy compressed the whole dynamic into a slogan decades ago: most advanced, yet acceptable. Novelty sells only up to the edge of familiarity.
So sameness is not a failure state of design culture. It is an attractor that design culture falls into whenever the cost of deviation rises or the reward for typicality grows. Every system that optimizes for immediate human approval will drift toward the prototype. Remember that sentence, because we are about to meet a training pipeline whose entire job is to optimize for immediate human approval.
The Machine of the Mean
Now the AI part, and the reason prompting harder does not get you out.
A language model is a compressed statistical portrait of its training corpus. When it generates, it samples from a probability distribution over continuations, and the safest probability mass sits near the mode: the most typical way the internet has ever built a landing page. For base models this distribution is actually fairly wide. The narrowing comes later, and we know where, because researchers have taken the pipeline apart stage by stage.
Robert Kirk and colleagues showed at ICLR 2024 that RLHF, the alignment stage where models are tuned against human preferences, substantially reduces output diversity compared to supervised fine-tuning, across models and across diversity metrics. They describe it as a direct trade-off: alignment buys generalisation and pays in variety6. A 2025 Stanford and Northeastern paper put a name on the root cause: typicality bias. Human annotators, asked which of two outputs they prefer, systematically favor the more familiar, more fluent, more prototypical one, exactly as the fluency literature predicts. That bias flows into the reward model, and the reward model then teaches the LLM that the mode is what humans want. The authors show this sharpening persists even with a perfect reward signal7.
Read that again, because it is the keystone of this whole topic. The same cognitive bias that made human-built websites converge through A/B tests and template marketplaces has been distilled into a reward function and applied at industrial scale. Mode collapse is not an alien machine pathology. It is beauty-in-averageness with a training budget.
And the result is no longer anecdotal. A NeurIPS 2025 paper introduced Infinity-Chat, a dataset of 26,000 real open-ended user queries, and evaluated more than 70 models on them. The finding earned a Best Paper Award and a memorable name, the Artificial Hivemind effect: individual models repeat themselves, and, worse, different models produce strikingly similar outputs to one another. Ask seventy models, get one aesthetic8.

The homogenization then propagates into human work. In a large experiment published in Science Advances, writers given AI-generated ideas produced stories that judges rated as more creative, with the biggest gains going to the least creative writers, but the AI-assisted stories were measurably more similar to each other than the unassisted ones. The authors call it a social dilemma: individually you are better off taking the model's help, collectively the pool of novelty shrinks9. A study at Creativity & Cognition 2024 found the same shape in ideation: users brainstorming with ChatGPT produced less semantically distinct ideas across users than those using a non-LLM creativity tool, and felt less ownership of the result10. And an ICLR 2024 study of co-writing found the telling detail: essays written with a feedback-tuned model showed significantly reduced content diversity, while essays written with the base model did not. The narrowing force is not the language model. It is the alignment11.
Two more pieces complete the machine. First, monoculture: when many decision-makers share one algorithm, outcomes correlate across the whole ecosystem, a dynamic Kleinberg and Raghavan formalized for hiring and Bommasani and colleagues confirmed empirically for shared models and datasets1213. Essentially the entire industry now designs through three or four frontier models. The correlation is not a risk. It is the setup. Second, the loop closes: AI-generated sites get deployed, scraped, and folded into the next training corpus. The Nature paper on model collapse showed what indiscriminate recursive training does to a distribution: the tails vanish first, irreversibly14.
The tails are where the identity lives
In distribution terms, the weird personal site, the hand-lettered nav, the portfolio that breaks the grid on purpose: these are tail events. Rare by definition, under-represented in every corpus, and the first thing recursive training erodes. The sameness machine does not attack identity. It simply forgets it, one training run at a time.
Why Your design.md Cannot Save You
The standard advice is to write better art direction into the context: a design.md, a brand voice document, a stack of adjectives in the CLAUDE.md. I do this. You probably do too. It helps less than it should, and the distribution view explains exactly why.
A prompt does not extend the model's distribution. It conditions it. You are selecting a region of what the model already contains, and then the prior fills in every decision you did not explicitly make. A landing page involves thousands of micro-decisions: spacing rhythm, weight contrast, corner radii, how a section opens, what an empty state says. Your design.md pins perhaps thirty of them. The other thousands come from the mode.
Worse, the words you pin them with are themselves the mode's vocabulary. "Clean, modern, minimal, bold, with personality" is not a fingerprint, it is the statistically densest region of all design writing on the internet. Every brand document says it. Conditioning on the most common adjectives in the corpus lands you in the most common designs in the corpus, now wearing your brand color. Same same, but different. And if you asked the model to write the design.md in the first place, you have conditioned the mode on the mode. The document that was supposed to encode your identity is itself a sample from the distribution you are trying to escape.
The mechanics get comically concrete. In August 2025, Tailwind's creator Adam Wathan publicly apologized for making every button in Tailwind UI bg-indigo-500 five years earlier, "leading to every AI generated UI on earth also being indigo"15. He was joking, and he was right. One default in one popular component library, repeated across millions of tutorials and repos, became a high-probability token sequence, and now an entire generation of generated interfaces reaches for the same color the way water reaches for the lowest point. The same gravity operates at the stack level, where every coding agent converges on React, Tailwind and shadcn/ui regardless of vendor; I took that apart in a separate post, and the mechanism is identical, just one layer down.

This is why "heavy art direction in the claude.md" feels like it forces you down a narrow path others have traveled: it does. The path was paved before you arrived. Instructions choose among existing roads. They do not build new ones.
The Research on Steering Back
Here is the encouraging part, and the reason this post is not another lament. Mode collapse turns out to be, at least partially, an interface problem rather than a capability loss. The diversity is still in the model. The question being worked on, at four different layers, is how to get it back out.
At the decoding layer. The verbalized sampling work mentioned above is the most practical result of the last year. Instead of asking for one answer, you ask the model to verbalize a distribution: five candidate directions, each with its estimated probability. Because the typicality sharpening is strongest on single-answer prompts, asking for the distribution routes around it. The authors measure 1.6 to 2.1 times higher diversity in creative tasks, training-free, on models you already use7. Alongside it sit sampler-level methods like min-p sampling, an ICLR 2025 oral, which truncates the token distribution dynamically so you can run hotter temperatures without losing coherence, though a follow-up analysis contests how large the gains really are1617.

At the training layer. Meta's Diverse Preference Optimization rewires the preference-tuning step itself: instead of always preferring the highest-scored response, it picks rare-but-good responses as the chosen examples and common-but-mediocre ones as rejected. The result in their experiments is a 45.6 percent increase in diversity of generated persona attributes and a 74.6 percent increase in story diversity at comparable quality18. Related work argues homogenization should be treated as task-dependent: you want one answer for a factual query and a spread for an aesthetic one, and training should learn the difference19. None of this is in your hands as a user, but it matters strategically: the labs now treat diversity loss as a defect to engineer against, not a cosmetic complaint.
At the representation layer. Anthropic's persona vectors work identifies directions in a model's activation space that correspond to character traits, and shows they can be monitored and steered20. Today the published traits are things like sycophancy, not visual taste. But the underlying idea, identity as a controllable direction rather than a prompt paragraph, is the most interesting long bet in this list. A future where "your studio's eye" is a vector you apply to a model, the way image people apply a style LoRA trained on their own work today, is technically continuous with what already exists.
At the workflow layer. This is where HCI research is most useful right now. Luminate, a CHI 2024 system, has the model first generate the dimensions of a design space and then populate it, so the human explores a structured field of genuinely different options instead of accepting the first sample21. A 2025 study showed that seeding LLMs with distinct personas measurably mitigates the homogenization effect in human-AI ideation22. The shared shape of all of it: diversity is recovered when the human stops being a consumer of one answer and becomes a curator of many.
Getting Your Identity Back, Today
The research above will reshape the tools. But you are shipping this quarter, with the models that exist. Here is what actually moves the needle, ordered by leverage, and every step is consistent with the mechanics above rather than wishful prompting.
1. Make the identity outside the model
The strongest signal you can give an agent is not prose, it is artifact. Models mirror the project they are dropped into far more faithfully than they follow abstract instructions. So decide the design language before the agent arrives: your palette as tokens, your two typefaces, your spacing scale, your border radii, your motion rules, encoded in a theme file or design tokens. Hand-build the first screen yourself, at whatever fidelity you can, and let the agent extend it. You are seeding the local distribution. From the model's perspective, your repo becomes the corpus.
2. Replace adjectives with artifacts
"Warm and editorial" selects a mode. A screenshot of the exact magazine spread you mean selects a point. Reference images, links to three sites that share the feeling you want, your own previous work: all of these condition the model on something it cannot get from the adjective cloud. If your agent supports vision, design reviews against screenshots beat design instructions in prose every time. Point at things. Words are where the mode hides.
3. Constrain by exclusion
The mode is a known place, so you can name it and fence it off. No Inter. No indigo, no violet gradients. No centered hero with two buttons. No three-card feature row. No glassmorphism. Negative constraints work better than positive ones because they do not rely on the model interpreting taste; they carve away the highest-probability region and force the sampler to land somewhere else. This is the single cheapest intervention on this list relative to its effect.
4. Generate wide, choose narrow
Apply the verbalized sampling result manually: never accept the first design. Ask for five directions that are explicitly forbidden from sharing a layout, a palette or a typographic idea, and ask the model to say which one is the most conventional, then discard that one first. Then act as the art director: pick, combine, and reject. The Science Advances study is clear that the model lifts the floor; the studies on homogenization are equally clear that the ceiling, the distinctive choice, comes from the human doing the curating. Taste is the input the model cannot supply, because taste is exactly what got averaged out of it.
5. Keep one hand-made thing per page
The tails are hand-made by definition. A drawn icon, a custom cursor, a strange 404, a chart that looks like nothing else, the navigation idea no template has. One genuinely hand-crafted element does more for perceived identity than a hundred prompt tokens, because it is the one part of the page that provably did not come from the distribution. It is also, not coincidentally, the part visitors remember.
What does not work
Stacking more adjectives into the design.md. Asking for "something less generic" (you will get the mode with a different accent color). Asking the model to invent your brand identity from scratch (you are sampling the thing you are escaping). Raising the temperature until coherence breaks. And trusting the model when it announces that the design is "unique and distinctive". Verify with your eyes, the way you would diff a PR.
The Mean Is a Starting Point, Not a Destination
The honest version of the conclusion has to grant the machine its due. The mode is the mode because, on average, people prefer it. It converts. It ships in an afternoon. For an internal tool, a prototype, a settings page, the average is not a defect, it is a gift; nobody needs an avant-garde admin panel. The Material 1 story teaches the same lesson from the other side: a strong shared baseline raised the floor of an entire ecosystem before it flattened the ceiling.
The problem is mistaking the floor for the building, and that mistake is now being made at the scale of the entire web, automatically, by default, with a feedback loop that erodes the alternatives out of future models. The sameness was always an attractor. What changed is that escaping it used to require ignoring a trend, and now it requires out-arguing a prior.
But the prior can be out-argued. The research says the diversity is still inside the models, recoverable by how we ask, trainable by better objectives, steerable as vectors, and explorable as design spaces. The history says convergence waves end when the tooling starts paying for deviation, the way Material Theming and the post-Bootstrap web eventually did. And the practice says that identity, today, costs exactly what it has always cost: making some decisions yourself, by hand, before the machinery arrives, and defending them when it pushes back.
AI-powered, not AI-squashed, is not a tooling feature you can wait for. It is a division of labor you have to decide on. The model supplies the average competently and instantly. You supply the reason anyone should remember the page. That part was never the machine's job, in 2014, in 2019, or now.
Sources
- Investigating the Homogenization of Web Design: A Mixed-Methods Approach
Goree, Doosti, Crandall, Su (CHI 2021). Computer vision over 227,000 website screenshots, 2003 to 2019: layout distance between sites fell 44 percent in the 2010s, with framework adoption strongly correlated with visual similarity.
- Why Do All Websites Look the Same?
Boris Müller's 2018 essay on the visual weariness of the web: templates are content-agnostic, and content-agnostic form severs the connection between form and content that design is supposed to be.
- Processing Fluency and Aesthetic Pleasure: Is Beauty in the Perceiver's Processing Experience?
Reber, Schwarz, Winkielman (2004). The foundational account of why easy-to-process stimuli are judged more beautiful: the psychology underneath every convergence wave in design.
- Prototypes Are Attractive Because They Are Easy on the Mind
Winkielman, Halberstadt, Fazendeiro, Catty (2006). The beauty-in-averageness effect, with fluency shown to mediate the attractiveness of prototypical stimuli.
- Understanding the Effects of RLHF on LLM Generalisation and Diversity
Kirk et al. (ICLR 2024). RLHF improves out-of-distribution generalisation but substantially reduces output diversity compared to supervised fine-tuning: alignment trades variety for reliability.
- Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
Zhang et al. (2025). Identifies typicality bias in human preference data as a root driver of mode collapse, and shows that asking for a verbalized distribution of answers recovers 1.6 to 2.1 times more diversity, training-free.
- Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Jiang et al. (NeurIPS 2025 Best Paper). Across 70+ models on 26,000 real open-ended queries: models repeat themselves, and different models produce strikingly similar outputs to one another.
- Generative AI enhances individual creativity but reduces the collective diversity of novel content
Doshi, Hauser (Science Advances, 2024). AI ideas make individual stories better, especially for less creative writers, while making the collection of stories measurably more alike. The social dilemma of generative assistance.
- Homogenization Effects of Large Language Models on Human Creative Ideation
Anderson, Shah, Kreminski (Creativity & Cognition 2024). ChatGPT users produce less semantically distinct ideas across users than users of a non-LLM creativity support tool, and feel less ownership of them.
- Does Writing with Language Models Reduce Content Diversity?
Padmakumar, He (ICLR 2024). Co-writing with a feedback-tuned model reduces content diversity; co-writing with the base model does not. The narrowing force is the alignment, not the language modeling.
- Algorithmic monoculture and social welfare
Kleinberg, Raghavan (PNAS 2021). The formal argument that many decision-makers sharing one algorithm can lower overall outcome quality even when the algorithm is individually superior.
- Picking on the Same Person: Does Algorithmic Monoculture lead to Outcome Homogenization?
Bommasani, Creel, Kumar, Jurafsky, Liang (NeurIPS 2022). Empirical evidence that shared models and shared training data homogenize outcomes across deployments.
- AI models collapse when trained on recursively generated data
Shumailov et al. (Nature, 2024). Recursive training on model output irreversibly erodes the tails of the distribution: the rare, the strange, and the personal vanish first.
- Adam Wathan's indigo apology
The Tailwind creator formally apologizes for bg-indigo-500, 'leading to every AI generated UI on earth also being indigo'. One library default, distilled into a global aesthetic.
- Diverse Preference Optimization
Lanchantin et al. (Meta, 2025). Rewiring preference tuning to choose rare-but-good responses: 45.6 percent more diverse persona generation and 74.6 percent more diverse stories at comparable quality.
- Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs
Nguyen et al. (ICLR 2025 oral). Dynamic truncation sampling for more diverse output at high temperatures, alongside a 2025 critical re-analysis (arXiv:2506.13681) that contests the size of the gains.
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Chen et al. (Anthropic, 2025). Character traits as directions in activation space that can be monitored and steered: the research thread closest to 'identity as a vector'.
- Luminate: Structured Generation and Exploration of Design Space with Large Language Models
Suh, Chen, Min, Li, Xia (CHI 2024). Have the model generate the dimensions of the design space first, then populate it: exploration instead of acceptance.
- Diverse AI Personas Can Mitigate the Homogenization Effect in Human-AI Collaborative Ideation
2025 study showing that seeding LLMs with distinct personas measurably counteracts idea homogenization in collaborative settings.
- Design Against the Machine
Boris Müller on teaching design with, and against, generative tools: you cannot critically evaluate a technology from the sidelines.
