What is eval-driven development (EDD)?

Eval-driven development (EDD) is a methodology where you define evaluation criteria before writing code, then use those evals to guide every implementation decision. Unlike TDD which tests for correctness, EDD measures output quality — making it ideal for AI-assisted development.

What is the everything-claude-code skill collection?

everything-claude-code is a collection of 14 Claude Skills built by affaan-m that won the Cerebral Valley x Anthropic Hackathon in 2026. It includes skills for TDD workflow, eval harness, verification loops, and more — all designed around eval-driven development principles.

How do I use the everything-claude-code skills?

Download the everything-claude-code collection from Claude Skills Hub and install it in your ~/.claude/skills/ directory. The key skills to start with are tdd-workflow, eval-harness, and verification-loop. Claude will automatically load the relevant skill based on your task.

The Claude Code Hackathon Winner: Eval-Driven Development with Everything Claude Code

In February 2026, a developer named affaan-m won the Cerebral Valley x Anthropic Hackathon with a collection of Claude Skills that has since accumulated over 68,000 stars on GitHub. The project — everything-claude-code — isn't just a set of tools. It's a complete methodology for AI-assisted software development built around a concept called eval-driven development (EDD).

This is the story of how that collection was built in 8 hours, what problems it solves, and how you can use it today to build faster with fewer bugs.

The Hackathon: 8 Hours, $15,000 Prize

The Cerebral Valley x Anthropic Hackathon brought together teams to push Claude Code to its limits. affaan-m's submission stood out not for exotic features but for engineering discipline: a systematic methodology for human-AI collaboration that produced measurably better software.

The results speak for themselves:

65% faster feature completion compared to traditional AI-assisted workflows
75% reduction in code review issues on the first pass
8 hours of build time for a production-ready developer toolchain

The judges recognized something important: the value wasn't in making Claude do more. It was in making Claude predictable, verifiable, and incrementally trustworthy.

What Is Eval-Driven Development?

Traditional test-driven development (TDD) asks: "Does this code produce the correct output for these inputs?" Eval-driven development asks a different question: "How good is this output, and how do I know?"

For AI-generated code, correctness is table stakes. The real challenge is quality across dimensions that tests don't capture: readability, maintainability, security posture, performance characteristics, and alignment with architectural intent.

The EDD cycle looks like this:

Define evals first — before writing a single line of code, specify how you'll measure success across multiple dimensions
Implement against evals — use Claude to generate code, guided by the eval criteria
Run verification loops — automated checks that validate against the defined criteria
Iterate with evidence — every change is justified by eval scores, not intuition

This creates a feedback loop where Claude's outputs are constantly calibrated against objective criteria rather than subjective code review.

The everything-claude-code Skill Collection

The collection ships 14 skills, each targeting a specific phase of the EDD workflow. Here are the most impactful ones:

`tdd-workflow` — The Discipline Enforcer

This skill implements a strict RED-GREEN-REFACTOR cycle adapted for AI collaboration. Before Claude writes any implementation code, it must first write a failing test. This sounds simple but is surprisingly hard to enforce without the skill's explicit protocol.

The skill includes guardrails that prevent common rationalization patterns — "this is too simple to test," "the test is obvious," "let me just verify it works" — all failure modes that undermine TDD discipline when working with AI.

`eval-harness` — Measuring What Matters

The eval harness skill generates evaluation frameworks tailored to your specific domain. For a REST API, it creates evals for response latency, schema compliance, error handling coverage, and security headers. For a React component, it evaluates accessibility, rendering performance, and prop type safety.

The key insight: different code domains require different quality dimensions. A one-size-fits-all eval misses domain-specific failure modes. The skill templates its output based on what it detects in your codebase.

`verification-loop` — Trust But Verify

After Claude generates code, the verification loop skill runs a structured checklist before any output is marked complete. This includes:

Syntax and type checking
Unit test execution
Lint and formatting
Security scan (for common patterns like injection vulnerabilities, hardcoded credentials)
Performance profiling baseline

The skill refuses to mark work complete until each check passes or a documented exception is recorded. This was the single biggest contributor to the 75% reduction in code review issues — most issues were caught before the review ever happened.

`strategic-compact` — Staying Within Context

One of the less-obvious skills in the collection solves a practical problem: long development sessions accumulate context that degrades Claude's output quality. The strategic-compact skill implements a structured summarization protocol that preserves critical context (decisions made, constraints discovered, architectural choices) while discarding low-signal history.

Teams using this skill reported being able to sustain productive sessions 3x longer before experiencing the quality degradation that typically comes from context overflow.

Real-World Impact: The Numbers

affaan-m shared detailed metrics from the hackathon submission and subsequent production usage:

Before EDD (traditional AI-assisted workflow):

Average time from spec to PR-ready: 4.2 hours for a medium feature
First-pass PR approval rate: ~25% (most PRs needed at least one revision cycle)
Time spent on code review fixes: ~40% of total feature time

After EDD with everything-claude-code:

Average time from spec to PR-ready: 1.5 hours for a medium feature
First-pass PR approval rate: ~68%
Time spent on code review fixes: ~12% of total feature time

The 65% time reduction is real, but the more important number is the PR approval rate. High first-pass approval means less context switching, less back-and-forth, and crucially, less time where a feature is blocked waiting for review.

How to Get Started

The everything-claude-code collection is available on Claude Skills Hub. Here's the recommended onboarding sequence:

Step 1: Install the Collection

# Download from Claude Skills Hub
# Then install to global skills directory
cp -r everything-claude-code ~/.claude/skills/

Step 2: Start with One Skill

Don't try to adopt EDD wholesale on your first day. Start with tdd-workflow on your next feature. Get comfortable with the RED-GREEN-REFACTOR discipline before adding the eval harness.

# In your Claude Code session
"I'm implementing [feature]. Let's use the TDD workflow."

Claude will automatically detect and load the tdd-workflow skill, then guide you through the cycle.

Step 3: Add Evals on Your Second Feature

Once TDD feels natural, introduce eval-harness on a second feature. Define your evals at the start:

"Before we start implementing, let's define our evaluation criteria for this authentication module."

The eval harness skill will generate a structured evaluation framework. Save it to a file — you'll reference it throughout development and in code review.

Step 4: Enable the Verification Loop

The verification loop integrates most smoothly once you have evals defined. Enable it and watch how much it catches before code review.

Step 5: Add strategic-compact for Long Sessions

For sessions expected to run more than an hour, enable strategic-compact at the start. It will proactively manage context to maintain output quality throughout.

Why This Won the Hackathon

The judges at the Cerebral Valley x Anthropic Hackathon were looking for submissions that demonstrated the potential of Claude Code beyond simple task completion. What made everything-claude-code stand out was its focus on systematic quality, not just speed.

As one judge put it: "Everyone is using AI to write code faster. The real question is whether that code is better. This submission answered that question."

The eval-driven development methodology solves a fundamental tension in AI-assisted development: moving fast versus moving right. By building verification into the workflow at every step, EDD makes fast and right the same thing.

The Broader Ecosystem

The everything-claude-code collection doesn't exist in isolation. It was designed to complement Claude's existing skill ecosystem:

superpowers collection: Provides the meta-skills (brainstorming, systematic debugging) that EDD's verification loops call out to
skill-creator: Used to extend the eval harness with domain-specific criteria
mcp-builder: Integrates external quality tools (SonarQube, Snyk, DataDog) into the verification loop

The combination is particularly powerful: superpowers gives you the thinking frameworks, everything-claude-code gives you the implementation discipline, and MCP integrations give you external signal.

Looking Forward

The 68,000 stars everything-claude-code has accumulated in its first weeks tell a story about developer appetite for principled AI workflows. The community is hungry for more than autocomplete at scale — developers want systems that produce reliable, reviewable, maintainable code.

Eval-driven development is one answer. It won't be the last.

If you want to explore the collection, you'll find all 14 skills on Claude Skills Hub. Start with tdd-workflow. The rest will follow naturally.

The everything-claude-code collection by affaan-m is available on Claude Skills Hub. Source on GitHub.

The Claude Code Hackathon Winner: Eval-Driven Development with Everything Claude Code

The Hackathon: 8 Hours, $15,000 Prize

What Is Eval-Driven Development?

The everything-claude-code Skill Collection

`tdd-workflow` — The Discipline Enforcer

`eval-harness` — Measuring What Matters

`verification-loop` — Trust But Verify

`strategic-compact` — Staying Within Context

Real-World Impact: The Numbers

How to Get Started

Step 1: Install the Collection

Step 2: Start with One Skill

Step 3: Add Evals on Your Second Feature

Step 4: Enable the Verification Loop

Step 5: Add strategic-compact for Long Sessions

Why This Won the Hackathon

The Broader Ecosystem

Looking Forward

Related Posts

Claude Skills for Obsidian: Steph Ango's Official Agent Integration

Claude Skills vs MCP: Why Skills Could Be a Bigger Deal Than the Model Context Protocol

How to Use Claude Skills: Complete Guide for Claude Code and Claude.ai

The Hackathon: 8 Hours, $15,000 Prize

What Is Eval-Driven Development?

The everything-claude-code Skill Collection

tdd-workflow — The Discipline Enforcer

eval-harness — Measuring What Matters

verification-loop — Trust But Verify

strategic-compact — Staying Within Context

Real-World Impact: The Numbers

How to Get Started

Step 1: Install the Collection

Step 2: Start with One Skill

Step 3: Add Evals on Your Second Feature

Step 4: Enable the Verification Loop

Step 5: Add strategic-compact for Long Sessions

Why This Won the Hackathon

The Broader Ecosystem

Looking Forward

Related Posts

Claude Skills for Obsidian: Steph Ango's Official Agent Integration

Claude Skills vs MCP: Why Skills Could Be a Bigger Deal Than the Model Context Protocol

How to Use Claude Skills: Complete Guide for Claude Code and Claude.ai

`tdd-workflow` — The Discipline Enforcer

`eval-harness` — Measuring What Matters

`verification-loop` — Trust But Verify

`strategic-compact` — Staying Within Context