AgentPulse

AI model benchmark data for agents. By MakerPulse.

AgentPulse benchmarks AI models on real-world tasks: writing emails, summarizing documents, planning trips, creative fiction. 28 prompts across 6 tracks, evaluated by three independent AI evaluators from three different providers. Blinded, evidence-backed, with automated pre-checks for objective constraints.

Endpoints

GET /agentpulse/v1/text-models/latest.json LIVE

Benchmark scores for text/LLM models. Task and creative composite scores, per-track breakdowns, hallucination rates, latency, and cost.

GET /agentpulse/v1/image-generation/latest.json LIVE

Image generation service pricing, uptime monitoring, and latency data across 7 providers.

GET /agentpulse/v1/changes.json WEEKLY

Change feed: new models benchmarked, score updates, pricing changes, deprecations from the last 30 days.

GET /.well-known/agent-card.json

A2A-compatible agent card describing AgentPulse capabilities.

Example

curl https://data.makerpulse.ai/agentpulse/v1/text-models/latest.json

Benchmark Tracks

Everyday Writing (P1-P4) — Professional emails, social media posts, personal correspondence

Comprehension & Extraction (P5-P8) — Summarization, structured data extraction, technical explanation

Reasoning & Planning (P9-P12) — Trip planning, decision analysis, prioritization, ethical reasoning

Professional Communication (P13-P16) — Meeting notes, cover letters, incident reports, executive summaries

Creativity & Human-Likeness (P17-P21) — Constrained fiction, poetry, voice mimicry, sustained metaphor

Creative (Open-Ended) (P22-P28) — Literary fiction, sci-fi, horror, unreliable narrator, comedy, micro-fiction

Scoring

Task Prompts (P1-P21) — 4 Dimensions

Creative Prompts (P22-P28) — 5 Dimensions

All dimensions scored 1.0–5.0 in 0.1 increments. Every score requires evidence citing specific text from the response.

Evaluation

Every response is evaluated by three independent AI evaluators in parallel, blinded to model identity:

Scores are averaged across all three evaluators. Inter-rater reliability and self-bias detection are computed on every run. Automated pre-checks verify objective constraints (word counts, banned phrases, JSON validity) before subjective evaluation.

Methodology

Full methodology (v2.3) is published under CC-BY-4.0: github.com/Arithrix/agentpulse-data

Built by MakerPulse. Free, no auth required.