Comprehensive Claude Code agent evaluation framework with multi-dimensional scoring, LLM-as-Judge mode, and research-backed performance variance analysis