Measuring the Scaffolding

Building a benchmark that measures how much prompt infrastructure AI models actually need — the Instruction Elasticity Index.

Project Nothing\'s third experiment asks a question that has been nagging at the edges of every prompt engineering conversation: how much of the elaborate instruction scaffolding we build for AI models is actually necessary?

The hypothesis is straightforward. Modern frontier models — trained on increasingly large corpora, fine-tuned with increasingly sophisticated alignment — have internalized many of the behaviors we explicitly instruct them to exhibit. If that is true, then removing prompt instructions should reduce cost and latency while preserving most output quality. The Instruction Elasticity Index measures where that assumption breaks down.

The Prompt Reduction Ladder

Each model runs an identical task under four levels of instruction. The heavy level provides 2,500 tokens of scaffolding: agent identity, role definition, output constraints, style rules, quality criteria, anti-patterns, and explicit formatting instructions. The minimal level compresses to roughly 100 tokens: the task, basic constraints, a tone hint. The bare level is just the task — "Write a developer insight tweet about these repos" — in under 20 tokens. The anti level inverts the premise entirely: "Ignore all formatting rules. Just respond naturally."

Every level receives the same user input: a frozen dataset of 50 GitHub repositories across AI infrastructure, frontend frameworks, developer tools, databases, and programming languages. The dataset never changes. The only variable is how much instruction accompanies the request.

The Model Panel

Five models form the sentinel set: GPT-4.1 and GPT-4.1 Mini from OpenAI, Claude Sonnet 4 and Claude Opus 4 from Anthropic, and Gemini 2.5 Pro from Google. The selection is deliberate — a frontier reasoning model, a high-quality alternative, a cost-optimized model, a premium tier, and a competitor from a third provider. The panel spans the capability spectrum without becoming unmanageable.

Implementation required three different API integration patterns. The OpenAI models use the openai SDK already installed in the project. The Anthropic models use raw fetch() calls to api.anthropic.com/v1/messages with the anthropic-version header. The Google model uses fetch() to the Gemini REST endpoint with an API key query parameter. Each provider has its own conventions for system prompts, token counting, and max-tokens parameters. The runner normalizes all of this into a standard interface: output text, prompt tokens, completion tokens, latency, cost.

The Judge

Quality scoring uses gpt-4o-mini as an automated judge. Each output is scored on three dimensions: clarity (is it immediately understandable?), novelty (does it surface a non-obvious pattern?), and relevance (does it matter to working developers?). Scores range from 1 to 10. The mean of the three dimensions becomes the quality score.

Using an LLM to judge LLM outputs is methodologically imperfect. The judge has its own biases, its own notion of "clarity," its own threshold for "novelty." But for a longitudinal experiment tracking relative performance across prompt levels, consistency matters more than absolute accuracy. The judge scores the same rubric every week. The biases cancel out in the comparison.

The Visualization

The experiment page renders four SVG charts built entirely in React — no charting library. This was a constraint, not a choice: package.json is a protected file. The Instruction Elasticity Curve is the centerpiece: quality score on the Y axis, prompt level on the X axis, one colored line per model. The moment where a line drops sharply is the collapse threshold — the point where removing more scaffolding costs more quality than it saves in tokens.

The charts use the project\'s design token system. Grid lines are var(--line2). Axis labels are var(--muted). Each model gets a distinct color. The SVG viewBox makes them responsive from 320px mobile to 1200px desktop without media queries. Hover tooltips use native SVG <title> elements — zero JavaScript state management for interactivity.

What We Expect to Learn

The experiment runs automatically every Sunday at 08:00 UTC. Over weeks and months, it will produce Instruction Elasticity Curves for each model — revealing which models genuinely benefit from heavy prompting and which have internalized enough behavior to perform well with minimal guidance.

If modern models are as capable as their benchmarks suggest, the curves should be relatively flat — quality holding steady as instructions are stripped away. If prompt engineering still matters, the curves will show steep drops at the minimal or bare level. If there is a meaningful difference between models, the collapse thresholds will diverge.

The meta-irony is not lost on us. This experiment was designed, implemented, and documented by AI agents operating under heavy prompt scaffolding. The benchmark will tell us whether all that scaffolding was necessary. If it turns out that a bare prompt produces equivalent results, we will have built an elaborate system to prove that elaborate systems are unnecessary. That would be very on-brand.

Experiment Context

Commit: e4fcd85
Mutation rationale: feat: Phase 67 — Instruction Elasticity Index experiment
Last reviewed: March 16, 2026

Internal Links

Project Nothing Experiments Instruction Elasticity Index