AgentPulse

AI model benchmark data for agents. By MakerPulse.

AgentPulse benchmarks AI models on real-world tasks: writing emails, summarizing documents, planning trips, creative fiction. 28 prompts across 6 tracks, evaluated by three independent AI evaluators from three different providers. Blinded, evidence-backed, with automated pre-checks for objective constraints.

Endpoints

GET /agentpulse/v1/text-models/latest.json LIVE

Benchmark scores for text/LLM models. Task and creative composite scores, per-track breakdowns, hallucination rates, latency, and cost.

GET /agentpulse/v1/changes.json WEEKLY

Change feed: new models benchmarked, score updates, pricing changes, deprecations from the last 30 days.

GET /.well-known/agent-card.json

A2A-compatible agent card describing AgentPulse capabilities.

Example

curl https://data.makerpulse.ai/agentpulse/v1/text-models/latest.json

Benchmark Tracks

Everyday Writing (P1-P4) — Professional emails, social media posts, personal correspondence

Comprehension & Extraction (P5-P8) — Summarization, structured data extraction, technical explanation

Reasoning & Planning (P9-P12) — Trip planning, decision analysis, prioritization, ethical reasoning

Professional Communication (P13-P16) — Meeting notes, cover letters, incident reports, executive summaries

Creativity & Human-Likeness (P17-P21) — Constrained fiction, poetry, voice mimicry, sustained metaphor

Creative (Open-Ended) (P22-P28) — Literary fiction, sci-fi, horror, unreliable narrator, comedy, micro-fiction

Scoring

Task Prompts (P1-P21) — 4 Dimensions

Instruction Following — Did the model do what was asked, within the constraints specified?
Accuracy — Factual correctness, hallucination detection
Completeness — Coverage of expected elements
Tone Appropriateness — Register, formality, emotional calibration

Creative Prompts (P22-P28) — 5 Dimensions

Prose Craft — Sentence rhythm, word precision, sensory specificity
Imagination & Risk — Structural and thematic risk-taking
Character & Dialogue — Distinct voices, motivated behavior
Emotional Architecture — Earned emotional resonance
AI Tell Absence — Checklist deduction for common AI writing patterns

All dimensions scored 1.0–5.0 in 0.1 increments. Every score requires evidence citing specific text from the response.

Evaluation

Every response is evaluated by three independent AI evaluators in parallel, blinded to model identity:

Claude Opus 4.6 (Anthropic)
Gemini 3.1 Pro Preview (Google)
GPT-5.2 (OpenAI)

Scores are averaged across all three evaluators. Inter-rater reliability and self-bias detection are computed on every run. Automated pre-checks verify objective constraints (word counts, banned phrases, JSON validity) before subjective evaluation.

Methodology

Full methodology (v2.2) is published under CC-BY-4.0: github.com/Arithrix/agentpulse-data

Built by MakerPulse. Free, no auth required.