CategoryShowcase
⭐ Featured

The Claude Code Hackathon Winner: Eval-Driven Development with Everything Claude Code

How affaan-m's everything-claude-code collection won the Cerebral Valley x Anthropic Hackathon with eval-driven development — and how you can use these Claude Skills to build faster, with fewer bugs.

Claude Skills TeamMarch 9, 20269 min read
#claude-skills#eval-driven-development#hackathon#claude-code#software-engineering
The Claude Code Hackathon Winner: Eval-Driven Development with Everything Claude Code

In February 2026, a developer named affaan-m won the Cerebral Valley x Anthropic Hackathon with a collection of Claude Skills that has since accumulated over 68,000 stars on GitHub. The project — everything-claude-code — isn't just a set of tools. It's a complete methodology for AI-assisted software development built around a concept called eval-driven development (EDD).

This is the story of how that collection was built in 8 hours, what problems it solves, and how you can use it today to build faster with fewer bugs.

The Hackathon: 8 Hours, $15,000 Prize

The Cerebral Valley x Anthropic Hackathon brought together teams to push Claude Code to its limits. affaan-m's submission stood out not for exotic features but for engineering discipline: a systematic methodology for human-AI collaboration that produced measurably better software.

The results speak for themselves:

  • 65% faster feature completion compared to traditional AI-assisted workflows
  • 75% reduction in code review issues on the first pass
  • 8 hours of build time for a production-ready developer toolchain

The judges recognized something important: the value wasn't in making Claude do more. It was in making Claude predictable, verifiable, and incrementally trustworthy.

What Is Eval-Driven Development?

Traditional test-driven development (TDD) asks: "Does this code produce the correct output for these inputs?" Eval-driven development asks a different question: "How good is this output, and how do I know?"

For AI-generated code, correctness is table stakes. The real challenge is quality across dimensions that tests don't capture: readability, maintainability, security posture, performance characteristics, and alignment with architectural intent.

The EDD cycle looks like this:

  1. Define evals first — before writing a single line of code, specify how you'll measure success across multiple dimensions
  2. Implement against evals — use Claude to generate code, guided by the eval criteria
  3. Run verification loops — automated checks that validate against the defined criteria
  4. Iterate with evidence — every change is justified by eval scores, not intuition

This creates a feedback loop where Claude's outputs are constantly calibrated against objective criteria rather than subjective code review.

The everything-claude-code Skill Collection

The collection ships 14 skills, each targeting a specific phase of the EDD workflow. Here are the most impactful ones:

tdd-workflow — The Discipline Enforcer

This skill implements a strict RED-GREEN-REFACTOR cycle adapted for AI collaboration. Before Claude writes any implementation code, it must first write a failing test. This sounds simple but is surprisingly hard to enforce without the skill's explicit protocol.

The skill includes guardrails that prevent common rationalization patterns — "this is too simple to test," "the test is obvious," "let me just verify it works" — all failure modes that undermine TDD discipline when working with AI.

eval-harness — Measuring What Matters

The eval harness skill generates evaluation frameworks tailored to your specific domain. For a REST API, it creates evals for response latency, schema compliance, error handling coverage, and security headers. For a React component, it evaluates accessibility, rendering performance, and prop type safety.

The key insight: different code domains require different quality dimensions. A one-size-fits-all eval misses domain-specific failure modes. The skill templates its output based on what it detects in your codebase.

verification-loop — Trust But Verify

After Claude generates code, the verification loop skill runs a structured checklist before any output is marked complete. This includes:

  • Syntax and type checking
  • Unit test execution
  • Lint and formatting
  • Security scan (for common patterns like injection vulnerabilities, hardcoded credentials)
  • Performance profiling baseline

The skill refuses to mark work complete until each check passes or a documented exception is recorded. This was the single biggest contributor to the 75% reduction in code review issues — most issues were caught before the review ever happened.

strategic-compact — Staying Within Context

One of the less-obvious skills in the collection solves a practical problem: long development sessions accumulate context that degrades Claude's output quality. The strategic-compact skill implements a structured summarization protocol that preserves critical context (decisions made, constraints discovered, architectural choices) while discarding low-signal history.

Teams using this skill reported being able to sustain productive sessions 3x longer before experiencing the quality degradation that typically comes from context overflow.

Real-World Impact: The Numbers

affaan-m shared detailed metrics from the hackathon submission and subsequent production usage:

Before EDD (traditional AI-assisted workflow):

  • Average time from spec to PR-ready: 4.2 hours for a medium feature
  • First-pass PR approval rate: ~25% (most PRs needed at least one revision cycle)
  • Time spent on code review fixes: ~40% of total feature time

After EDD with everything-claude-code:

  • Average time from spec to PR-ready: 1.5 hours for a medium feature
  • First-pass PR approval rate: ~68%
  • Time spent on code review fixes: ~12% of total feature time

The 65% time reduction is real, but the more important number is the PR approval rate. High first-pass approval means less context switching, less back-and-forth, and crucially, less time where a feature is blocked waiting for review.

How to Get Started

The everything-claude-code collection is available on Claude Skills Hub. Here's the recommended onboarding sequence:

Step 1: Install the Collection

# Download from Claude Skills Hub
# Then install to global skills directory
cp -r everything-claude-code ~/.claude/skills/

Step 2: Start with One Skill

Don't try to adopt EDD wholesale on your first day. Start with tdd-workflow on your next feature. Get comfortable with the RED-GREEN-REFACTOR discipline before adding the eval harness.

# In your Claude Code session
"I'm implementing [feature]. Let's use the TDD workflow."

Claude will automatically detect and load the tdd-workflow skill, then guide you through the cycle.

Step 3: Add Evals on Your Second Feature

Once TDD feels natural, introduce eval-harness on a second feature. Define your evals at the start:

"Before we start implementing, let's define our evaluation criteria for this authentication module."

The eval harness skill will generate a structured evaluation framework. Save it to a file — you'll reference it throughout development and in code review.

Step 4: Enable the Verification Loop

The verification loop integrates most smoothly once you have evals defined. Enable it and watch how much it catches before code review.

Step 5: Add strategic-compact for Long Sessions

For sessions expected to run more than an hour, enable strategic-compact at the start. It will proactively manage context to maintain output quality throughout.

Why This Won the Hackathon

The judges at the Cerebral Valley x Anthropic Hackathon were looking for submissions that demonstrated the potential of Claude Code beyond simple task completion. What made everything-claude-code stand out was its focus on systematic quality, not just speed.

As one judge put it: "Everyone is using AI to write code faster. The real question is whether that code is better. This submission answered that question."

The eval-driven development methodology solves a fundamental tension in AI-assisted development: moving fast versus moving right. By building verification into the workflow at every step, EDD makes fast and right the same thing.

The Broader Ecosystem

The everything-claude-code collection doesn't exist in isolation. It was designed to complement Claude's existing skill ecosystem:

  • superpowers collection: Provides the meta-skills (brainstorming, systematic debugging) that EDD's verification loops call out to
  • skill-creator: Used to extend the eval harness with domain-specific criteria
  • mcp-builder: Integrates external quality tools (SonarQube, Snyk, DataDog) into the verification loop

The combination is particularly powerful: superpowers gives you the thinking frameworks, everything-claude-code gives you the implementation discipline, and MCP integrations give you external signal.

Looking Forward

The 68,000 stars everything-claude-code has accumulated in its first weeks tell a story about developer appetite for principled AI workflows. The community is hungry for more than autocomplete at scale — developers want systems that produce reliable, reviewable, maintainable code.

Eval-driven development is one answer. It won't be the last.

If you want to explore the collection, you'll find all 14 skills on Claude Skills Hub. Start with tdd-workflow. The rest will follow naturally.


The everything-claude-code collection by affaan-m is available on Claude Skills Hub. Source on GitHub.

Related Posts