2026-02-14

Exploiting the Actual Mechanics of LLMs: A Framework Atlas

A deep dive into what LLMs mechanistically do well, and frameworks designed to exploit those strengths.

15 min read

Part 1: The Core Mechanics (What the Architecture Actually Does — and Where It Breaks)

Before building frameworks, we need to be precise about what an LLM mechanistically does — not what the marketing says, and not a simple list of strengths. Each property below is a neutral architectural mechanic: a consequence of how transformers work that produces both a reliable strength (to exploit) and a predictable weakness (to route around). The 8 frameworks in Part 2 each exploit specific mechanic strengths while compensating for specific mechanic weaknesses.

Compressed Pattern Blending · Sequential Token Generation · Context Window as Working Memory · Lossy Expert Compression · Edit > Generate Asymmetry · Parallel Exploration Capacity

Mechanic 1: Compressed Pattern Blending

An LLM has compressed billions of documents into weighted connections. It doesn't "know" things — it has structural templates for how things relate. When you ask about pricing strategy, it activates a cluster of patterns from every pricing discussion it's ever seen: consulting frameworks, academic papers, Reddit arguments, SEC filings, blog posts. Critically, these structural templates are separable from their content domains — the argumentative structure of a philosophy paper, the narrative arc of a screenplay, the logical flow of a legal brief. These structures can be mapped across domains the model has never seen combined.

Strength — Cross-Domain Structure Transfer: The model holds structural templates across thousands of domains and can map proven frameworks from one domain onto entirely new territory. Cross-domain prompts produce structurally sound novelty because the model is mapping an established framework onto unexplored terrain. Force a cross-domain mapping and the model must find structural connections it would never activate in a normal query.

Weakness — Centroid Convergence: The default output is always the statistical centroid — the most average, most common version of a pattern. Vanilla prompts produce vanilla output not because the model is dumb, but because it gives the safest answer in probability space. Every user asking the same question gets functionally identical responses. Your "custom" marketing strategy is the same one your competitor got five minutes ago — it just sounds convincing enough that neither of you notices.

Supporting evidence: A PNAS 2025 study found AI-generated stories repeat the same plot element combinations across generations — individually plausible, collectively identical [1]. A separate study testing outputs across multiple LLM families found their responses were far more similar to each other than human responses are to each other, confirming centroid convergence is a structural property of the architecture, not a quirk of any single model [2]. A study of 2,200 college essays found that each additional human essay contributed more new ideas than each additional AI essay, and the diversity gap widened with more essays [3]. An ICLR 2024 paper formalized "analogical prompting," showing LLMs can self-generate exemplars from analogous domains to solve new problems, outperforming standard CoT prompting [4]. A 2025 materials science paper demonstrated that LLMs using explicit cross-domain analogies produced novel material candidates outside established compositional spaces [5]. The "Thought Propagation" framework (2023) showed that leveraging analogous problems improves performance where Chain-of-Thought and Tree-of-Thought fail [6].

Chef's Analogy: Imagine a chef who has cooked in every single restaurant on the planet — taco trucks, sushi bars, Parisian bistros, your grandmother's kitchen, a spaceship cafeteria. Their brain is a giant mashup of every recipe ever written. The strength? You can say "make me Mexican food using Japanese techniques" and they'll hand you a miso-braised carnitas taco that no cookbook on earth contains — because they can mix-and-match building blocks across cuisines like a kid combining LEGO sets. The weakness? Walk in and just say "cook me the best dinner you can." This chef has every cuisine on earth in their head — they could make any of them. But "best" doesn't point anywhere specific. So the chef reaches for the safest dish in the room. The one with the most gravitational pull. The one that the most people, across every restaurant they've ever worked, would accept without complaint. What lands on your plate is mac and cheese. Not because anyone's favorite dinner is mac and cheese — but because nobody's least favorite dinner is mac and cheese. It's like asking a thousand people to name their favorite color and going with the one that offends nobody. You end up with beige — not anyone's actual favorite, just the least objectionable.

Origins: Dense vector representations of knowledge — Mikolov et al., Word2Vec (2013). Cross-domain attention at scale — Vaswani et al., "Attention Is All You Need" (2017). Centroid convergence empirically characterized — creative homogeneity studies [1][2][3] (2025).

Interactive visualization

Mechanic 2: Sequential Token Generation

LLMs generate output one token at a time, each conditioned on everything that came before. The attention mechanism finds token sequences that satisfy all weighted conditions simultaneously. When given a clear schema, template, or output format, the structured format acts as a "rail" — heavily constraining the output space at each step. The model has seen millions of examples of formatted content (JSON, tables, templates, forms), so structured generation is a well-trodden path.

Strength — Constraint Satisfaction and Structured Fidelity: The model is remarkably good at satisfying multiple simultaneous constraints ("Write 200 words with these 5 keywords, avoiding these 3 words, in this tone, ending with a question"). Each constraint eliminates a region of probability space, pushing the model away from the centroid toward the edges where interesting solutions live. Structured output formats are followed with high reliability because structure constrains the next-token prediction at every step.

Weakness — Greedy Local Optimization and Planning Failures: Because generation is sequential and forward-only, the model can't backtrack. It makes locally optimal token choices that may be globally suboptimal. On long-horizon tasks, this produces greedy shortcuts — early commitments that foreclose better solutions downstream. The model tries to hold entire complex plans in working memory simultaneously, and the resulting step-by-step reasoning creates cascading errors on multi-step tasks.

Supporting evidence: The Italy/ChatGPT natural experiment found that when ChatGPT was temporarily banned, restaurant marketing content showed 15% decreases in textual similarity and a 3.5% increase in customer engagement — less AI assistance produced more distinctive content [7]. The broader content homogenization literature (74% of new web pages contain AI-generated content per Ahrefs [8]) establishes the problem that constraint stacking solves. "Why Reasoning Fails to Plan" (January 2026) demonstrated that LLM step-by-step reasoning creates greedy shortcuts causing rapid degradation on complex multi-step tasks [15].

Chef's Analogy: This chef has to cook an 8-course meal, one course at a time, in order — and no protein can repeat. Once it's used, it's gone. Give them a pre-planned menu with exact recipes for all 8 courses? They'll execute it beautifully — the best proteins saved for the moments that matter most. But ask them to improvise? They use the wagyu in course 2 — a small appetizer bite — because it's the best protein available right now. Gorgeous bite. But course 6 is the main, the centerpiece of the whole meal, and the best ingredient in the kitchen is already gone. They're building the most important dish of the night around their fourth-best option because they spent the star on an opener. Every early choice was the best option in the moment and a problem for later.

Origins: Autoregressive generation — Radford et al., GPT (2018). Text degeneration from greedy decoding — Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020). Planning failures formalized — "Why Reasoning Fails to Plan" (2026) [15].

Interactive visualization

Mechanic 3: Context Window as Working Memory

The context window is the only "memory" the model has at inference time. Everything in it — the system prompt, the conversation history, any documents you paste — gets attended to during generation. The model pattern-matches against context window contents with much higher weight than its general training. This makes the context window a powerful lever: what you put in it determines which training distributions activate.

Strength — Domain Priming: Strategically loading the context window with specific domain knowledge dramatically shifts output quality. This isn't just "providing background" — it's selecting which of the model's internal knowledge clusters get highest activation. The same model produces fundamentally different (and better) output when primed with your actual data, real metrics, and specific constraints versus being asked cold.

Weakness — Finite Capacity and Positional Degradation: The context window has hard limits, and even within those limits, attention is not uniform. The "lost in the middle" problem means information in the center of long contexts gets disproportionately less attention than information at the beginning and end. There is no persistent memory across conversations. At scale, even the best models show significant degradation — Stanford's HELM evaluation found the best model scored just 0.588/1.0 on long-context tasks at 128K tokens.

Supporting evidence: "Context engineering" was coined and popularized by Andrej Karpathy and Shopify CEO Tobi Lütke in June 2025 [21][22]. Anthropic formalized the concept in September 2025 [23]. The "lost in the middle" problem was first documented by Liu et al. (Stanford, 2023) [24] and confirmed at scale by NVIDIA's RULER benchmark [25] and Stanford's HELM Long Context evaluation [26]. By late 2025, an academic survey of 1,300+ papers formalized context engineering as a distinct discipline [27].

Chef's Analogy: Picture a chef with amnesia who can only cook with whatever ingredients are sitting on the counter right in front of them. No pantry, no fridge, no "oh wait, I have cumin in the back." If it's not on the counter, it doesn't exist. The strength? Load that counter with your exact ingredients — your specific dietary needs, the flavors you love, the leftovers in your fridge — and this chef will cook circles around the same chef staring at a bare counter going "uh… pasta?" The weakness? The counter is only so big. Pile too much on it and the stuff in the middle gets buried under everything else. The salt shaker at the front and the olive oil at the back get used. The saffron in the middle? Forgotten. What you put on that counter — and where you put it — is the whole game.

Origins: In-context learning at scale — Brown et al., GPT-3 (NeurIPS 2020) [16]. Lost-in-the-middle degradation — Liu et al. (Stanford, 2023) [24]. Capacity benchmarks — RULER (NVIDIA, 2024) [25], HELM Long Context (Stanford, 2025) [26]. "Context engineering" coined — Karpathy & Lütke (June 2025) [21][22].

Mechanic 4: Lossy Expert Compression

The model has ingested the equivalent of millions of expert-hours across thousands of domains. It can't perfectly recall any one expert's work, but it has compressed the reasoning patterns of experts in nearly any field. Think of it like a colleague who has read every business book ever written — they can't quote any one accurately, but they've internalized the underlying patterns of strategic thinking. Role prompting activates different subsets of these compressed patterns, literally changing which knowledge clusters are most active.

Strength — Role-Activated Specialization: A "senior McKinsey partner" prompt and a "veteran short-seller" prompt produce structurally different analyses because they activate different training distributions. The model holds thousands of these compressible expert personas, making it possible to simulate multi-perspective analysis from a single system. Role prompting isn't a stylistic trick — it's a mechanism for navigating the model's compressed expertise.

Weakness — Confident Hallucination: Because the model can't distinguish between patterns it retrieved from training data and patterns it plausibly generated by interpolation, it produces fabricated information with the same confidence as factual information. The lossy compression means the boundary between "I learned this" and "I inferred this" doesn't exist internally. Counterintuitively, more capable reasoning models can hallucinate more because they're better at constructing plausible-sounding chains of inference from incomplete knowledge.

Supporting evidence: Role prompting efficacy is well-documented across the prompt engineering literature. Anthropic's documentation recommends assigning specific expert roles to improve output quality. The mechanistic claim about weight activation is a simplification of how attention patterns shift based on context — the model doesn't have literally separate "expert modules," but the statistical effect is analogous. SimpleQA hallucination benchmarks show persistent confident confabulation across model families.

Chef's Analogy: This chef once spent a week in every famous kitchen on earth — the sushi master in Tokyo, the pasta grandmother in Tuscany, the pastry wizard in Paris. They didn't memorize any single chef's recipes word-for-word, but they absorbed the vibe — the philosophy, the hand movements, the instincts. Say "cook Japanese" and the knife work changes, the plating gets minimal, the philosophy shifts to restraint. Say "cook Italian grandma" and suddenly it's all generous portions and "taste this, taste this." Same chef, entirely different meal. The weakness? Sometimes they "remember" learning a technique they actually never saw. They'll plate something with total confidence, tell you a detailed story about the master who taught it to them — and the whole thing is made up. Stitched together from half-remembered kitchens into something that sounds real but never happened. And here's the scary part: they can't tell the difference. The real memories and the invented ones feel exactly the same to them.

Origins: Neural compression of knowledge — Hinton & Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks" (Science, 2006). Role-following via scale — Brown et al., GPT-3 (2020) [16]. Hallucination benchmarking — OpenAI SimpleQA (2024).

Interactive visualization

Mechanic 5: Edit > Generate Asymmetry

LLMs are measurably better at improving existing text than generating from scratch. Editing is a narrower task — the model can focus its attention on specific elements rather than generating everything simultaneously. Critique-and-rewrite cycles exploit this asymmetry, producing output 30-60% better than single-shot generation.

Strength — Iterative Refinement: The generate → critique → rewrite loop leverages the model's strong evaluation capabilities. Each pass narrows the problem: the first pass creates raw material, subsequent passes refine specific dimensions. The model's critique ability is often stronger than its generation ability because evaluation requires less creative search than production.

Weakness — Single-Shot Default: Single-shot generation is the model's weakest mode — yet it's how the vast majority of people use LLMs. One prompt, one response, done. This means most users experience the model at its worst: generating everything simultaneously with no opportunity for self-correction. The gap between single-shot and iterative output quality is one of the largest and most consistently documented improvements available.

Supporting evidence: The "Self-Refine" framework (Madaan et al., 2023) demonstrated that iterative self-feedback loops improve LLM output across code generation, math reasoning, and dialogue by 5-40% across tasks without any additional training [9]. Constitutional AI (Anthropic, 2022) uses a similar critique-revision loop as a core training methodology [10].

Chef's Analogy: Here's a dirty secret about cooking: fixing a dish is way easier than nailing it on the first try. Soup too bland? Add salt. Too salty? Add acid. Too thin? Reduce it. Every good chef knows the magic loop: make it, taste it, wince, fix it, taste it again, smile. The dish gets dramatically better with every lap around that loop. The weakness? Most people use this chef like a vending machine — press a button, grab the bag of chips, walk away. One prompt, one response, done. They never stick a spoon in and say "needs more garlic." So they experience the chef at their absolute worst: everything cooked in a single blind pass, no tasting, no adjusting, no second chances. That's like eating the first draft of every meal. Nobody should eat first drafts.

Origins: Iterative self-refinement formalized — Madaan et al., "Self-Refine" (NeurIPS 2023) [9]. Critique-revision as training methodology — Bai et al., Constitutional AI (Anthropic, 2022) [10].

Interactive visualization

Mechanic 6: Parallel Exploration Capacity

No human can generate 50 variations of a headline in 30 seconds. LLMs can, and more importantly, they can generate variations spanning different dimensions of variation (tone, length, audience, angle) simultaneously. This is fundamentally a parallel search through possibility space.

Strength — Rapid Breadth Search: The model can quickly explore a wide solution space, generating many options across multiple dimensions of variation. This makes it ideal for the divergent phase of any creative or strategic process — generating candidates for human evaluation.

Weakness — Centroid Clustering Without Diversity Forcing: Without explicit instructions to diversify, the generated variations cluster around the centroid (linking back to Pattern Blending's weakness). Asking for "10 options" often yields 10 variations of the same underlying idea. The model's default is to produce the most probable variation each time, which means repeated sampling without diversity constraints produces redundant output rather than genuinely different alternatives.

Supporting evidence: This weakness is a direct consequence of Pattern Blending's centroid convergence operating at the variation level. The creative homogeneity studies [1][2][3] demonstrate this clustering effect — AI outputs are more similar to each other than human outputs, even when explicitly asked for variety.

Chef's Analogy: Imagine a chef with 50 hands who can plate 50 different appetizers in the time it takes you to butter toast. Incredible speed. The weakness? Without being told to actually vary the cooking method, all 50 dishes come out looking like cousins at a family reunion — different garnish, same DNA underneath. One has parsley, another has cilantro, a third has chives. But they're all the same chicken on the same plate with the same sauce. You have to explicitly say "give me one raw, one braised, one fried, one fermented, one that's basically a drink" — or the "variety" is just a costume change on the same dish. Speed without direction produces quantity, not diversity.

Origins: Temperature-based sampling — Ackley, Hinton & Sejnowski (1985). Nucleus (top-p) sampling — Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020). Clustering weakness demonstrated — cross-LLM homogeneity studies [1][2][3] (2025).

Interactive visualization

Part 2: The Frameworks (How I Actually Use These Mechanics)

The 6 mechanics above are the science — what transformers do, where they break, and why. The 8 frameworks below are how I actually put that science to work. Each one is a repeatable prompting pattern designed to exploit specific mechanic strengths while routing around specific mechanic weaknesses. I didn't invent most of these techniques — they come from established research (see provenance notes per framework). What I've built is the mapping layer: knowing which mechanic each framework leverages, which weakness it compensates for, and when to reach for one versus another. That mapping is what turns a bag of prompting tricks into a system.

Cross-Domain Synthesis · Perspective Multiplication · Constraint Stacking · Recursive Decomposition · Exemplar Anchoring · Inversion Prompting · Diverge-Converge Cycling · Context Priming

Framework 1: Cross-Domain Synthesis (The "Pirate-Python")

Exploits

Pattern Blending strength (cross-domain structure transfer) — forces the model to map proven frameworks from solved domains onto new territory

Compensates

Pattern Blending weakness (centroid convergence) — no cliché exists at the intersection of two unrelated domains, so the centroid is bypassed

Provenance

The concept of analogical reasoning dates to Gentner's Structure Mapping Theory (1983) [11]. Its application to LLM prompting was formalized as "Analogical Prompting" by Yasunaga et al. (ICLR 2024) [4] and extended to scientific discovery by the LacMaterial paper (2025) [5]. The "Thought Propagation" framework (Yu et al., 2023) demonstrated analogous-problem solving improves over CoT/ToT [6]. Our contribution: the specific business-strategy domain-pairing table, the emphasis on breakdown analysis, and the fusion cuisine framing.

The Mechanic: You take a domain where the structural logic is well-established ("solved") and force the model to map that structure onto an unsolved or ambiguous domain. The model can't fall back on clichés because no cliché exists at the intersection of the two domains.

The Template:

Analyze [NEW DOMAIN PROBLEM] using the structural logic of [SOLVED DOMAIN].

Specifically:
- What is the equivalent of [SOLVED DOMAIN CONCEPT A] in [NEW DOMAIN]?
- What is the equivalent of [SOLVED DOMAIN CONCEPT B] in [NEW DOMAIN]?
- Where does the analogy break down, and what does that breakdown reveal?

Example Pairings That Produce Exceptional Results:

Solved Domain	→	New Domain	Why It Works
Evolutionary biology (natural selection, niche adaptation)	→	Startup market strategy	Forces thinking about competitive fitness, niche survival, mutation as pivoting
Thermodynamics (entropy, energy states)	→	Team management	Organizations naturally tend toward disorder; energy input required to maintain structure
Poker strategy (position, pot odds, bluffing)	→	Negotiation	Incomplete information games have well-developed mathematical frameworks
Military logistics (supply chains, force projection)	→	Product launch planning	Decades of optimization thinking about resource deployment under uncertainty
Music composition (tension/resolution, counterpoint)	→	Narrative copywriting	Structural patterns of engagement, anticipation, and payoff

The Key Insight: The breakdown points in the analogy are often MORE valuable than the mappings. When you ask "where does this analogy fail?", the model identifies the unique structural properties of your actual domain — the things that make your problem genuinely different.

Consultant Analysis: This is highest-value for strategic planning, competitive analysis, and creative ideation. It converts the AI's breadth of knowledge into genuine strategic insight rather than regurgitated best practices.

Chef's Analysis: This is what happens when you walk into the kitchen and say "Cook me a French dish… but using only Korean ingredients." The chef can't fall back on any existing recipe — there's no cookbook for French-Korean fusion. So they have to think structurally: "Okay, a French mother sauce is built on fat + flavor base + liquid. In Korean cooking, the fat is sesame oil, the flavor base is fermented paste, the liquid is anchovy broth." What comes out is something genuinely new — not in any tradition's cookbook. But here's where it gets really interesting: the breakdown points are the gold. The moment the chef says "wait, the French technique of finishing with butter doesn't work with gochujang because the fermentation chemistry fights the dairy" — that is where you learn something nobody knew about either tradition. The failures of the mashup reveal hidden truths about both ingredients.

Framework 2: Perspective Multiplication (The "Expert Panel")

Exploits

Expert Compression strength (role-activated specialization) — each role activates a different region of compressed expert knowledge

Compensates

Expert Compression weakness (confident hallucination) — contradictions between perspectives surface where the model is interpolating rather than retrieving; Pattern Blending weakness (centroid convergence) — multiple viewpoints cover more of the data space than a single query

Provenance

Role/persona prompting is documented across the prompt engineering literature (2023-present) and in official documentation from Anthropic, OpenAI, and Google. Multi-perspective debate frameworks have been explored in "Debating with More Persuasive LLMs Leads to More Truthful Answers" (Du et al., 2023) [12] and "Society of Mind" prompting. Our contribution: the specific synthesis step requiring explicit conflict identification, the "Hostile Witness" pattern, and the kitchen-panel framing.

The Mechanic: Instead of asking for one answer, you ask the model to generate multiple analyses from structurally different viewpoints. Each role activates a different region of the model's weight space. The VALUE isn't in any single perspective — it's in the gaps and contradictions between them.

The Template:

Analyze [PROBLEM/DECISION] from these three perspectives. Each analyst should identify their top 3 concerns AND directly challenge at least one conclusion from the other analysts:

1. [ROLE A — e.g., "A CFO focused purely on 18-month cash flow"]
2. [ROLE B — e.g., "A customer experience researcher who has interviewed 500 users"]
3. [ROLE C — e.g., "A competitor's head of strategy trying to defeat this product"]

After all three analyses, synthesize: where do all three agree (high-confidence conclusions), where do they disagree (areas requiring more data), and what question would each analyst ask that the others wouldn't think of?

Power Move — The Hostile Witness: Add a fourth role whose job is specifically to destroy the best argument. "Now analyze this as a short-seller writing a public report arguing this company will fail. Use specific data points." This forces the model to steelman the opposition and surface risks you'd otherwise miss.

Consultant Analysis: This replicates the actual structure of high-end consulting — McKinsey doesn't send one analyst, they send a team with different specializations. The framework's real value is in the synthesis step, where contradictions become decision-relevant insight.

Chef's Analysis: You put the same plate of food in front of three completely different people. The pastry chef picks it up and says "the texture's wrong — this needs a crunch element or the mouth gets bored." The grizzled line cook looks at it and says "sure, it's pretty, but I can't make 200 of these on a Friday night without the whole kitchen falling apart." The food critic takes one bite and says "your average diner has no idea what 'umami-forward with a dashi reduction' means — they'll send it back." Three people. Same plate. Three problems that were invisible to each other. No single taster catches all three — the magic is in the disagreements. That's where the real information lives. The spot where the pastry chef and the line cook argue? That's your most important design decision.

Framework 3: Constraint Stacking (The "Creative Prison")

Exploits

Sequential Generation strength (constraint satisfaction) — each constraint eliminates a region of probability space, forcing the model to the creative edges

Compensates

Pattern Blending weakness (centroid convergence) — exclusion constraints specifically eliminate the generic center; Parallel Exploration weakness (centroid clustering) — forces genuinely different solutions by blocking the obvious ones

Provenance

Constraint-based design is a foundational principle in design thinking (IDEO, 1990s onward). Its application to LLM prompting is broadly documented in prompt engineering guides. The specific connection to AI content homogenization research is supported by: the Italy/ChatGPT natural experiment [7], Ahrefs' finding that 74% of new web pages contain AI content [8], and the PNAS study on LLM plot diversity collapse [1]. Our contribution: the three-tier constraint structure (hard/exclusion/quality), the emphasis on exclusion constraints as the primary anti-homogenization lever, and the probability-space elimination framing.

The Mechanic: Every constraint you add eliminates a region of the model's output probability space. Add enough constraints and the "safe, average" center becomes impossible — the model MUST find solutions at the creative edges. The metaphor: jazz musicians produce their most creative work within rigid structures (12-bar blues, specific key signatures). The constraints aren't limitations — they're forcing functions for creativity.

The Template:

[TASK DESCRIPTION]

Hard Constraints (must satisfy ALL):
- Maximum [length/budget/time]
- Must include [specific element]
- Must work for [specific audience]
- Must be implementable by [specific resource level]

Exclusion Constraints (must avoid ALL):
- Do NOT use [common approach A]
- Do NOT include [cliché B]
- Do NOT assume [typical assumption C]
- Exclude any solution requiring [unavailable resource]

Quality Constraints:
- Every recommendation must include a specific first step executable in under 1 hour
- Every claim must include a way to verify it
- Prefer counterintuitive solutions over obvious ones

Example: Instead of "Give me marketing ideas for my SaaS product," try:

Give me 5 customer acquisition strategies for a B2B SaaS product ($49/mo price point, developer audience).

Hard constraints: Each strategy must cost under $200 to test, take less than 2 weeks to show initial signal, and be executable by a single person.

Exclusions: No paid social media ads. No content marketing. No cold email. No referral programs. No Product Hunt launches. These are what everyone does.

Quality: For each strategy, include the specific first action I take tomorrow morning, the metric I measure after 2 weeks, and one real company that used this approach.

The exclusion list is the key lever. By eliminating the 5 most common strategies, you force the model past the centroid of "SaaS marketing advice" into less-traveled territory.

Consultant Analysis: This is the framework for when you need differentiated strategy, not best-practice regurgitation. The exclusion constraints should specifically list what your competitors are already doing — forcing the model to find approaches they aren't using.

Chef's Analysis: You walk into the kitchen and say: "Make me dessert. But no dairy. No refined sugar. It has to serve four people. And it can't be a fruit salad." Now watch what happens. No dairy? There goes crème brûlée, ice cream, panna cotta — a whole universe of lazy answers, gone. No sugar? Say goodbye to most pastry. Can't be fruit salad? That kills the laziest fallback of all. Every constraint is like closing an escape route. You're backing the chef into a corner — and that's the point. With all the easy doors locked, the chef is forced to get genuinely weird: maybe a coconut milk mousse with date caramel and smoked salt. Maybe a black sesame halvah with roasted plums. Dishes they never would have reached for if you'd left the easy doors open. The constraints aren't handcuffs — they're trampolines. They don't limit creativity, they launch it.

Framework 4: Recursive Decomposition (The "Fractal Zoom")

Exploits

Edit > Generate strength (iterative refinement) — each pass operates at a manageable complexity level with built-in self-correction

Compensates

Sequential Generation weakness (greedy local optimization / planning failures) — decomposition prevents the model from committing to a flawed global plan; each subproblem is small enough to solve reliably

Provenance

Builds on Chain-of-Thought prompting (Wei et al., 2022) [13], Plan-and-Solve prompting (Wang et al., 2023) [14], and Least-to-Most prompting (Zhou et al., 2023). The planning failure problem it addresses was formalized in "Why Reasoning Fails to Plan" (January 2026), which demonstrated that LLM step-by-step reasoning creates greedy shortcuts causing rapid degradation on complex multi-step tasks [15]. Standard project management decomposition (Work Breakdown Structure) is the non-AI predecessor. Our contribution: the specific three-pass template with built-in stress testing, and the tasting-menu decomposition analogy.

The Mechanic: LLMs fail at complex planning because they try to hold the entire plan in working memory simultaneously, creating greedy shortcuts. The fix: decompose the problem into levels, solve each level independently, then compose the results. Like a fractal — each zoom level reveals more detail, but you only need to reason about one level at a time.

The Template (3 passes):

Pass 1 — The Architecture (5 minutes):

I need to [GOAL]. Don't solve this yet. Instead, break this into 3-5 major phases. For each phase, tell me:
- What it accomplishes
- What it depends on (prerequisites)
- What could go wrong
- How I'd know it's done (success criteria)

Pass 2 — The Detail (per phase):

Now zoom into Phase [N]. Break it into specific tasks. For each task:
- Exact steps to execute
- Time estimate
- Tools/resources needed
- The most likely failure mode and how to detect it early

Pass 3 — The Stress Test:

Review the complete plan. Identify:
- The single point of failure most likely to kill the entire project
- The task where my time estimate is probably wrong (and why)
- The dependency I haven't accounted for
- What I should do FIRST to retire the biggest risk as early as possible

Why This Works: Each pass operates at a manageable complexity level. The model never needs to reason about 20 steps simultaneously — just 3-5 at a time. And Pass 3 exploits the model's strong critique capabilities to catch the planning errors that Pass 1 inevitably introduced.

Consultant Analysis: This is how actual project managers de-risk complex initiatives — progressive elaboration with risk identification at each level. The framework converts the AI from a one-shot planner (where it's unreliable) into an iterative planning assistant (where it's strong).

Chef's Analysis: Imagine asking a chef to cook a 12-course tasting menu all at once — every pot going, every plate spinning, every timer running simultaneously. Disaster. Burnt risotto, cold soup, overcooked fish. The fix? You don't cook 12 courses. You cook one course 12 times. First, you zoom out and sketch the whole menu on paper: "Okay, we start light, build to heavy, end with something refreshing." That's the architecture. Then you zoom into each course and nail the details: ingredients, timing, technique. Then — and this is the move — you step back and taste the whole progression before anyone sits down. "Wait. Course 4 and Course 7 are both heavy cream-based. That'll wreck the pacing. Swap Course 7 for something acidic." You caught the problem because you reviewed the whole arc after getting the details right, instead of trying to hold everything in your head from the start. One level at a time. That's how you cook a banquet without burning down the kitchen.

Framework 5: Exemplar Anchoring (The "Show Don't Tell")

Exploits

Pattern Blending strength (cross-domain structure transfer) — the model pattern-matches on provided examples with high fidelity; Sequential Generation strength (structured output fidelity) — examples act as implicit structural constraints

Compensates

Pattern Blending weakness (centroid convergence) — examples anchor output away from the generic center toward your specific standard; Expert Compression weakness (confident hallucination) — concrete examples reduce the space for fabrication by constraining the output pattern

Provenance

Few-shot prompting is foundational, originating in GPT-3's paper (Brown et al., 2020) [16]. The technique is one of the most well-documented in all of prompt engineering. Anti-examples / negative examples have been explored in instruction-tuning and RLHF literature. Our contribution: the specific template structure with anti-example, the example-count decision guide, and the reference-plate analogy.

The Mechanic: The model is a pattern-completion machine. If you show it 2-3 examples of exactly what you want, it will pattern-match on those examples far more reliably than it will follow abstract instructions. This is few-shot prompting, but the power move is in how you select your examples.

The Template:

I need [TASK DESCRIPTION]. Here are 2-3 examples of the quality and format I want:

EXAMPLE 1 (best example — the gold standard):
[paste your best example of the output you want]

EXAMPLE 2 (acceptable but different angle):
[paste another example showing a different valid approach]

ANTI-EXAMPLE (what I do NOT want):
[paste an example of the generic/bad output you want to avoid]

Now produce [N] new outputs following the pattern of Examples 1-2 while avoiding the pattern of the Anti-Example. [SPECIFIC INSTRUCTIONS]

The Anti-Example is the Secret Weapon: Showing the model what you DON'T want is often more powerful than showing what you do want. It creates a clear boundary in the output space — "everything in this region is wrong." The model then optimizes within the remaining space.

When to Use Which Example Count:

1 example: When you need consistent format/structure
2 examples: When you want the model to identify the underlying pattern (it triangulates between the two)
3 examples: When the pattern is subtle or the domain is specialized
1 example + 1 anti-example: When your biggest problem is the model producing generic output

Consultant Analysis: This is the single most reliable framework for any production workflow where you need consistent, high-quality output. It's the difference between a creative brief that says "make it punchy" versus one that includes three reference ads and says "like this." Always choose showing over telling.

Chef's Analysis: Instead of telling the chef "make it punchy" (what does that even mean?), you pull out your phone and show them three photos: "I loved this dish at that Italian place. I loved this one from the Thai restaurant downtown. And this one from the tasting menu last month." Now the chef isn't guessing — they're pattern-matching on concrete examples. They can see the thread: "Oh, you like bold acid, textural contrast, and dramatic plating." But the real power move is the anti-example — showing them the one dish you hated. "See this? This bland, overdressed, lukewarm hotel banquet chicken? Not this." That single bad example draws a line in the sand. The chef now has a force field around the territory to avoid, and three North Stars to aim toward. Showing always beats telling. Three photos on the counter communicate more than a hundred words of instruction.

Framework 6: Inversion Prompting (The "Backward Oracle")

Exploits

Expert Compression strength (role-activated specialization) — activates the model's rich compressed knowledge of failure patterns and post-mortems; Edit > Generate strength (iterative refinement) — the invert step is an edit/transform of the failure plan, not generation from scratch

Compensates

Expert Compression weakness (confident hallucination) — failure modes are more concrete and verifiable than success playbooks, reducing fabrication; Sequential Generation weakness (planning failures) — inverting known failure modes produces more robust plans than forward planning

Provenance

The pre-mortem technique was developed by cognitive psychologist Gary Klein and first described in his book The Power of Intuition (2003) [17]. Research by Mitchell, Russo & Pennington (1989) found that prospective hindsight increases the ability to correctly identify reasons for future outcomes by 30% [18]. Charlie Munger popularized "inversion thinking" in investing contexts. Application of pre-mortems to AI-assisted product development is documented in practitioner literature (e.g., AI Prompt Hackers, 2025) [19]. Our contribution: the specific three-step LLM template (generate failure → invert → check blind spots), the argument about asymmetric training data density (failure_patterns > success_patterns), and the restaurant-failure analogy.

The Mechanic: LLMs have been trained on vast amounts of post-mortem analysis, failure case studies, and critiques. This training data is underutilized because people only ask the model to generate positive plans. Inversion prompting asks the model to generate the negative plan first — all the ways something could fail — and then inverts that into a robust positive strategy.

The Template:

Step 1: "You are a consultant hired to make [PROJECT/COMPANY/STRATEGY] fail as completely as possible within 12 months. Create a detailed plan for guaranteed failure. Be specific — which decisions would you make, which signals would you ignore, which mistakes are most common?"

Step 2: "Now invert every element of that failure plan into a specific preventive action or success strategy. For each failure mode, what is the exact opposite behavior, and how would I implement it?"

Step 3: "Which items from the failure plan are things I might already be doing without realizing it?"

Why This Works Better Than "What Should I Do?": When you ask "how do I succeed?", you get the statistical average of success advice. When you ask "how would this fail?", you get specific, concrete failure modes drawn from actual case studies and post-mortems. The model's training data contains vastly more analysis of failure than playbooks for success — because humans write more post-mortems than victory laps. Step 3 is where the real value lives: it surfaces blind spots.

Consultant Analysis: This is literally the pre-mortem technique used by elite military planners and management consultants. Klein's research shows pre-mortems increase the ability to identify reasons for failure by 30% [18]. The AI amplifies this because it has access to failure patterns across every industry simultaneously.

Chef's Analysis: Instead of asking "what should be on the menu?", you flip the whole thing upside down: "Pretend you're trying to destroy this restaurant. How would you make it fail a health inspection, get annihilated by the food critic, and lose every good cook on staff within six months?" Suddenly the answers get terrifyingly specific: "Easy — skip the deep-clean on the prep station, build a wine list that's pretentious but poorly curated so regulars feel stupid, and cancel family meal so the line cooks feel like disposable labor and quit." Each failure mode is a laser-pointed insight. Now you just flip every one into a protective action: obsessive prep-station hygiene, a wine list that educates instead of intimidates, a family meal that makes the team feel like family. But the real gut-punch is Step 3 — "which of these am I already doing?" The chef who's been skipping family meal suddenly sees it in a completely different light. That's the mirror moment. The failures you're already committing are the ones that matter most.

Framework 7: Diverge-Converge Cycling (The "Diamond Protocol")

Exploits

Parallel Exploration strength (rapid breadth search) — the diverge phase generates many candidates quickly; Edit > Generate strength (iterative refinement) — the converge/hybridize phases refine through evaluation and combination

Compensates

Parallel Exploration weakness (centroid clustering) — explicitly demanding "10 fundamentally different approaches" with distinct core logic forces exploration past the cluster; Sequential Generation weakness (greedy local optimization) — the hybrid step combines structural strengths that the model wouldn't assemble in a single-shot generation

Provenance

Directly adapted from IDEO's "Double Diamond" design thinking process (British Design Council, 2005) [20] and McKinsey's hypothesis-driven problem solving methodology. The diverge-then-converge pattern is foundational in design thinking, creative problem solving, and innovation management. Our contribution: the specific 4-phase LLM template, the "10 not 3" insight about exhausting obvious approaches, and the tasting-menu hybridization analogy.

The Mechanic: The most common mistake is asking for "the best" answer. This forces the model to converge immediately on a single solution — and that solution will always be the most probable (i.e., most generic) one. The Diamond Protocol alternates between divergent phases (generate many options) and convergent phases (evaluate and select), mimicking the actual process used by designers and strategists.

The Template:

PHASE 1 — DIVERGE: "Generate 10 fundamentally different approaches to [PROBLEM]. I don't want 10 variations of the same idea — I want 10 approaches that differ in their core logic. Label each with a 2-word name."

PHASE 2 — EVALUATE: "Score each approach on three criteria: [CRITERION 1], [CRITERION 2], [CRITERION 3]. Use a 1-5 scale. Show the scoring matrix."

PHASE 3 — HYBRIDIZE: "Take the top 3 approaches. Identify what's strongest about each. Now design a hybrid approach that combines the best elements of all three while being internally consistent."

PHASE 4 — STRESS TEST: "Argue against the hybrid. What's the strongest case that this approach will fail? What assumption is most likely wrong?"

Why 10 and not 3: At 3 options, the model produces variations of the same idea. At 10, it's forced to explore genuinely different regions of solution space. Options 7-10 are often the most interesting because the obvious approaches are exhausted by option 5.

Consultant Analysis: This replicates McKinsey's "hypothesis-driven problem solving" and IDEO's design thinking process. The key insight: the hybrid from Phase 3 is almost always better than any single option from Phase 1, because it combines structural strengths the model wouldn't have assembled in a single-shot generation.

Chef's Analysis: You tell the chef: "Give me 10 completely different appetizers for a spring menu. And I don't mean 10 salads with different toppings — I mean 10 fundamentally different ideas." At 3 options, you get salad, soup, and a tartare. Boring. The obvious answers. But at 10, the chef runs out of obvious by option 5 and is forced into genuinely weird territory: a warm broth served in a teacup, a raw crudo with frozen olive oil snow, a fermented pickle bite that wakes up the whole mouth, a bread-and-butter course reimagined as a savory doughnut. Options 7 through 10 are where the surprises live — because the safe ideas are already used up. Then you taste the top 3 and ask the real question: "What if we took the brightness from the crudo, the warmth from the broth, and the texture from the bread course… and combined them into one dish?" The Frankenstein hybrid is almost always better than any of its parents, because it combines strengths the chef would never have assembled in a single first attempt.

Framework 8: Context Priming (The "Pre-Game Brief")

Exploits

Context Window strength (domain priming) — strategically loading the context window activates the most relevant knowledge clusters

Compensates

Context Window weakness (positional degradation / finite capacity) — the front-loading rule and the verification step mitigate lost-in-the-middle and hallucinated gap-filling; Expert Compression weakness (confident hallucination) — Step 3 forces the model to distinguish between context-grounded claims and interpolated ones

Provenance

"Context engineering" was coined and popularized by Andrej Karpathy and Shopify CEO Tobi Lütke in June 2025 [21]. Karpathy described it as "the delicate art and science of filling the context window with just the right information for the next step" [22]. Anthropic formalized the concept in September 2025 [23]. The "lost in the middle" problem that informs the front-loading rule was first documented by Liu et al. (Stanford, 2023) [24] and confirmed at scale by NVIDIA's RULER benchmark [25] and Stanford's HELM Long Context evaluation [26]. By late 2025, an academic survey of 1,300+ papers formalized context engineering as a distinct discipline [27]. Our contribution: the specific three-step template with built-in hallucination verification (Step 3), the 20%/10% front-loading rule, and the mise en place analogy.

The Mechanic: Instead of asking the AI a question cold, you first load its context window with the specific domain knowledge it needs. This isn't just "providing background" — it's strategically selecting which training distributions you want to activate. The model doesn't retrieve from external memory; it pattern-matches against what's in the context window with much higher weight than its general training.

The Template:

STEP 1 — PRIME: "Here is the key context for this task:

[DOMAIN DOCUMENT 1 — e.g., your company's actual data, real metrics, specific constraints]
[DOMAIN DOCUMENT 2 — e.g., the competitive landscape as you understand it]
[STYLE REFERENCE — e.g., a previous deliverable you liked]

Important context the model should know:
- [SPECIFIC FACT 1 that contradicts common assumptions]
- [SPECIFIC FACT 2 about your unique situation]
- [SPECIFIC CONSTRAINT that wouldn't be obvious]"

STEP 2 — TASK: "Given everything above, [SPECIFIC QUESTION/TASK]."

STEP 3 — VERIFY: "What assumptions did you make that weren't stated in the context I provided? Flag anything you inferred rather than read directly."

Why Step 3 Matters: This catches hallucination. The model will often "fill in gaps" with plausible-sounding but fabricated information. By asking it to flag its assumptions, you create a separation between context-grounded claims and invented ones.

Critical Rule: Front-load important information. Due to the "lost in the middle" problem [24], place your most critical context in the first 20% and last 10% of your prompt. Anything buried in the middle gets disproportionately less attention. Stanford's HELM evaluation found the best model scored just 0.588/1.0 on long-context tasks at 128K tokens [26].

Consultant Analysis: This is the difference between a consultant who reads your brief before the meeting versus one who shows up cold. Same consultant, dramatically different output. The quality of your priming documents directly determines the quality of the AI's output.

Chef's Analysis: Two identical chefs. Same skills, same knives, same kitchen. Chef A walks in cold — no idea who's eating, what they like, or what the occasion is. "What do you want?" Chef B gets a full briefing first: "It's a couple celebrating their anniversary. She's allergic to shellfish. He loves bold spice but she prefers subtle. They had a heavy lunch so keep it light. Last time they came, they raved about the ceviche." Same chef. Dramatically different meal. Chef B is cooking with a GPS; Chef A is driving blindfolded. But the real power move is Step 3 — asking the chef after cooking: "What did you assume that I didn't actually tell you?" That's where the chef says "I assumed they drink wine" or "I guessed this was a casual night, not a special occasion." Those gaps between what you briefed and what the chef invented on their own? That's exactly where mistakes hide. Surfacing the assumptions is how you catch them before the food hits the table.

Part 3: The Meta-Framework (Choosing Which Framework to Use)

Situation	Primary Framework	Supporting Framework	Why
"I need a strategy/plan"	Recursive Decomposition (#4)	Inversion Prompting (#6)	Exploits Edit > Generate strength (iterative refinement) to decompose complexity; compensates Sequential Generation weakness (planning failures) via decomposition. Inversion then exploits Expert Compression strength (failure-pattern knowledge) to stress-test assumptions the decomposition missed.
"I need creative/distinctive ideas"	Constraint Stacking (#3)	Cross-Domain Synthesis (#1)	Exploits Sequential Generation strength (constraint satisfaction) to eliminate the Pattern Blending weakness centroid. Cross-Domain then exploits Pattern Blending strength (structure transfer) to provide a generative template from an unexpected domain — constraints are subtractive, synthesis is additive.
"I need to analyze a decision"	Perspective Multiplication (#2)	Diverge-Converge (#7)	Exploits Expert Compression strength (role specialization) to generate rich contradictions from multiple expert lenses. Diverge-Converge then exploits Parallel Exploration strength (breadth search) + Edit > Generate strength (refinement) to systematically score and hybridize, compensating for Parallel Exploration weakness (clustering).
"I need consistent production output"	Exemplar Anchoring (#5)	Context Priming (#8)	Exploits Pattern Blending strength (pattern matching on examples) + Sequential Generation strength (structured fidelity) to anchor output to your standard. Context Priming exploits Context Window strength (domain priming) to load specific data, compensating for Expert Compression weakness (hallucination) via the verification step.
"I need to find risks/blind spots"	Inversion Prompting (#6)	Perspective Multiplication (#2)	Exploits Expert Compression strength (rich failure-pattern knowledge) to surface concrete failure modes. Perspective Multiplication then stress-tests from multiple angles, compensating for Pattern Blending weakness (centroid) by covering more of the data space.
"I need to explore a new market/domain"	Cross-Domain Synthesis (#1)	Diverge-Converge (#7)	Exploits Pattern Blending strength (structure transfer) to map known frameworks onto unknown territory. Diverge-Converge then exploits Parallel Exploration strength (breadth search) to systematically evaluate options, compensating for Parallel Exploration weakness (clustering) via explicit diversity forcing.
"I need to improve existing work"	Exemplar Anchoring (#5) with anti-examples	Constraint Stacking (#3)	Exploits Pattern Blending strength (pattern matching) by showing the current version + what's wrong. Constraint Stacking exploits Sequential Generation strength (constraint satisfaction) to block the bad patterns, compensating for Pattern Blending weakness (centroid) by making the old approach impossible.

Pairing principle: Each primary framework has a structural blind spot mapped to a specific mechanic weakness. The supporting framework is chosen to patch it by exploiting a complementary mechanic strength. Decomposition (strong on Edit > Generate, compensates Sequential Generation weakness) produces internally coherent but potentially naive plans → Inversion (strong on Expert Compression) stress-tests assumptions. Constraints (strong on Sequential Generation, compensates Pattern Blending weakness) are subtractive but not generative → Cross-Domain (strong on Pattern Blending) provides a structural template. Perspectives (strong on Expert Compression) generate rich contradictions but no resolution → Diverge-Converge (strong on Parallel Exploration + Edit > Generate) provides scoring and hybridization.

Part 4: Compounding — Chaining Frameworks Together

The real power users don't use one framework per task. They chain them:

The Full-Stack Research Protocol:

Context Prime (#8) → Load all your domain knowledge (exploits Context Window strength)
Cross-Domain Synthesis (#1) → Generate an unconventional analytical lens (exploits Pattern Blending strength)
Perspective Multiplication (#2) → Run the analysis through 3 expert views (exploits Expert Compression strength)
Inversion (#6) → Pre-mortem the conclusions (compensates Sequential Generation weakness)
Constraint Stack (#3) → Force actionable, non-generic recommendations (compensates Pattern Blending weakness)
Exemplar Anchor (#5) → Format the output to match your deliverable standard (exploits Sequential Generation strength)

Each step takes 2-5 minutes. Total time: 15-30 minutes. Quality: comparable to a junior analyst working 2-3 days.

The Product Launch Protocol:

Recursive Decomposition (#4) → Break the launch into phases (compensates Sequential Generation weakness)
Inversion (#6) → "How would this launch fail?" for each phase (exploits Expert Compression strength)
Diverge-Converge (#7) → Generate 10 GTM approaches, evaluate, hybridize (exploits Parallel Exploration strength)
Constraint Stack (#3) → Force strategies within your actual budget/team/timeline (exploits Sequential Generation strength)
Perspective Multiplication (#2) → Evaluate from customer, competitor, and investor POV (exploits Expert Compression strength)

Appendix A: The Chef's Unified Model

If you think of the LLM as an infinitely experienced chef, the 6 core mechanics are kitchen characteristics and the 8 frameworks are cooking techniques that play to those characteristics:

Core Mechanics as Kitchen Characteristics:

Mechanic	Kitchen Characteristic	Strength	Weakness
Compressed Pattern Blending	Has cooked in every restaurant on the planet — taco trucks to Michelin stars	Can mash up any two cuisines into something no single cookbook contains — like a kid combining LEGO sets	Without direction, every cuisine votes at once and you get beige — the dish everybody already likes
Sequential Token Generation	Squeezes food out of a tube — once it's down, it's down	Follows a precise recipe like a paint-by-numbers masterpiece	Can't scrape the sauce off and start over; early choices lock doors behind them
Context Window as Working Memory	An amnesiac chef who can only cook with what's on the counter	Load the counter with your exact ingredients and they cook circles around a cold kitchen	Counter is only so big; the stuff in the middle gets buried and forgotten
Lossy Expert Compression	Spent a week in every famous kitchen on earth — absorbed the vibe, not the exact recipes	"Cook Japanese" and "cook Italian grandma" produce entirely different meals from the same pantry	Sometimes "remembers" techniques they never learned — confabulates with total confidence and can't tell the difference
Edit > Generate Asymmetry	Tasting and adjusting beats nailing it on the first pour	The make-taste-fix loop produces dramatically better results than a single blind pass	Most people use this chef like a vending machine — one button, one bag of chips, done
Parallel Exploration Capacity	50 hands plating 50 appetizers in the time you butter toast	Massive parallel exploration of the possibility space	Without explicit direction to vary technique, all 50 dishes are cousins — different garnish, same DNA

Frameworks as Cooking Techniques:

Framework	Kitchen Equivalent	Why It Improves Results
Cross-Domain Synthesis	"Cook French food with only Korean ingredients"	Produces dishes that don't exist in any tradition's cookbook; breakdowns reveal hidden truths
Perspective Multiplication	Pastry chef, line cook, and food critic all taste the same plate	Three invisible problems surface in the disagreements between them
Constraint Stacking	"No dairy, no sugar, can't be a fruit salad"	Closes every escape route; the chef is backed into genuine creativity
Recursive Decomposition	Planning a 12-course menu one course at a time, then tasting the arc	Each course is simple enough to nail; the review catches pacing problems
Exemplar Anchoring	Three photos of dishes you loved + one of a dish you hated	Shows instead of tells; the anti-example draws a force field around bad territory
Inversion Prompting	"How would you destroy this restaurant?" then invert every answer	Failure modes are terrifyingly specific; success advice is vague
Diverge-Converge	10 fundamentally different appetizers → taste top 3 → Frankenstein hybrid	The hybrid combines strengths no single first-attempt dish would have
Context Priming	Full anniversary-dinner briefing before the chef touches a pan	Same chef, GPS vs. blindfold — the briefing is the difference

The fundamental principle: Asking a generic question is like walking into a restaurant, sitting down, and saying "make me food." You'll get something edible. You'll never get something memorable. Every framework above is a way of giving the chef better directions — locking certain doors so they're forced to find new ones, showing them photos of what you loved, briefing them on who's eating, or asking "how could this go horribly wrong?" The chef is brilliant either way. The difference between a forgettable meal and an unforgettable one was never the chef's talent — it was your instructions.

Appendix B: Sources

[1] "Echoes in AI: Quantifying lack of plot diversity in LLM outputs." Proceedings of the National Academy of Sciences (PNAS), 2025. https://www.pnas.org/doi/10.1073/pnas.2504966122

[2] "We're Different, We're the Same: Creative Homogeneity Across LLMs." arXiv, January 2025. https://arxiv.org/html/2501.19361v1

[3] "Homogenizing effect of large language models (LLMs) on creative diversity: An empirical comparison of human and ChatGPT writing." ScienceDirect, 2025. https://www.sciencedirect.com/science/article/pii/S294988212500091X

[4] Yasunaga, M. et al. "Large Language Models as Analogical Reasoners." ICLR 2024. https://arxiv.org/pdf/2310.01714

[5] "LacMaterial: Large Language Models as Analogical Chemists for Materials Discovery." arXiv, October 2025. https://arxiv.org/html/2510.22312

[6] Yu, L. et al. "Thought Propagation: An Analogical Approach to Complex Reasoning with Large Language Models." arXiv, October 2023. https://arxiv.org/html/2310.03965v1

[7] Liu, C., Wang, T. & Yang, S.A. "Generative AI and Content Homogenization: The Case of Digital Marketing." SSRN, 2025. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5367123

[8] "74% of New Webpages Include AI Content (Study of 900k Pages)." Ahrefs, 2025. https://ahrefs.com/blog/what-percentage-of-new-content-is-ai-generated/

[9] Madaan, A. et al. "Self-Refine: Iterative Refinement with Self-Feedback." NeurIPS 2023. https://arxiv.org/abs/2303.17651

[10] Bai, Y. et al. "Constitutional AI: Harmlessness from AI Feedback." Anthropic, 2022. https://arxiv.org/abs/2212.08073

[11] Gentner, D. "Structure-Mapping: A Theoretical Framework for Analogy." Cognitive Science 7(2), 1983, pp. 155-170.

[12] Du, Y. et al. "Improving Factuality and Reasoning in Language Models through Multiagent Debate." arXiv, 2023. https://arxiv.org/abs/2305.14325

[13] Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. https://arxiv.org/abs/2201.11903

[14] Wang, L. et al. "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models." ACL 2023. https://arxiv.org/abs/2305.04091

[15] "Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents." arXiv, January 2026. https://arxiv.org/abs/2601.22311

[16] Brown, T. et al. "Language Models are Few-Shot Learners." NeurIPS 2020. https://arxiv.org/abs/2005.14165

[17] Klein, G. The Power of Intuition: How To Use Your Gut Feelings To Make Better Decisions At Work. Currency/Doubleday, 2003.

[18] Mitchell, D.J., Russo, J.E. & Pennington, N. "Back to the future: Temporal perspective in the explanation of events." Journal of Behavioral Decision Making 2(1), 1989, pp. 25-38.

[19] "Pre-Mortem Your Product Launch Before It Crashes." AI Prompt Hackers, August 2025. https://www.aiprompthackers.com/p/pre-mortem-your-product-launch-before

[20] British Design Council. "The Double Diamond: A universally accepted depiction of the design process." Design Council, 2005/2019. https://www.designcouncil.org.uk/our-resources/the-double-diamond/

[21] Lütke, T. Post on X (formerly Twitter), June 18, 2025. "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."

[22] Karpathy, A. Post on X (formerly Twitter), June 25, 2025. https://x.com/karpathy/status/1937902205765607626

[23] Anthropic. Formalization of context engineering concept, September 2025 (referenced in Tao An, "Context Engineering Is Replacing Prompt Engineering for Production AI," Medium, December 2025).

[24] Liu, N.F. et al. "Lost in the Middle: How Language Models Use Long Contexts." Stanford, 2023. https://arxiv.org/abs/2307.03172

[25] Hsieh, C.Y. et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?" NVIDIA, 2024. https://arxiv.org/abs/2404.06654

[26] Stanford CRFM. "HELM Long Context." September 2025. https://crfm.stanford.edu/2025/09/29/helm-long-context.html

[27] Referenced in Marin, J. "Context Engineering vs. Prompt Engineering." Medium / Data Science Collective, October 2025. https://medium.com/data-science-collective/context-engineering-vs-prompt-engineering-3493c2925e99

[28] "AI models collapse when trained on recursively generated data." Nature, 2024. https://www.nature.com/articles/s41586-024-07566-y

[29] "Strong Model Collapse." OpenReview / ICLR 2025. https://openreview.net/forum?id=et5l9qPUhm — Found that even contamination of 1 in 1,000 data points with synthetic content can trigger collapse.