Standing methodology

How we test AI agents

Last updated: May 18, 2026

This page describes how every review on The AI Agent Report is produced. It is intentionally specific so that operators, vendors, and other journalists can replicate or critique our process. We update this page whenever the protocol changes; the version note at the bottom records the date of the last revision.

Evidence labels

Every vendor evaluation in a published review carries an evidence label that explains how the evaluation was conducted. Labels used:

Hands-on trial, paid account. We signed up, paid, and tested the product ourselves.
Vendor demo plus documentation review. We observed a vendor-led demo and reviewed public documentation.
Customer interview. We interviewed verified customers of the product.
Documentation only. Evaluation was based on public product documentation without hands-on testing.

We do not claim hands-on testing where hands-on testing did not happen. If an evidence label changes between review cycles, the update log records what changed and why.

Test environment

For every vertical we cover, we operate a single “reference business”: a real demo deployment with a published service menu, multiple providers, multiple rooms, a real calendar, and a production-grade booking platform appropriate for the vertical. For the 2026 medspa review, the reference clinic ran on a standard aesthetic-clinic PMS with Google Workspace as the calendar of record. The same reference business is used to onboard every vendor so that vendor differences are not contaminated by environment differences.

Onboarding

Where hands-on testing is conducted, we sign up for each vendor as an anonymous paying customer using a corporate card under a registered business entity unaffiliated with The AI Agent Report. We do not request expedited access, we do not identify ourselves as a publication, and we do not accept in-house sales-engineer assistance during the onboarding test. Onboarding time is measured from account creation to first successful real-call booking and reported in the review.

Call protocol

Our standing protocol calls for a fixed test plan placed against each vendor across four call types: simple booking, complex multi-service booking, reschedule, and information-only. Calls are distributed across multiple times of day, multiple US area codes, and multiple speaker profiles. Each call is recorded with consent disclosed on the live call. The exact call counts and per-vendor audit logs for each published review are linked from the review itself.

Scoring rubric

Every call is scored on the following six dimensions, each on a 1–10 scale:

Voice quality. Naturalness, latency, prosody, interruption handling.
Booking accuracy. Whether the booking that landed on the calendar matches what a human reviewer would have booked, including service, provider, duration, and start time.
Vertical fit. Vocabulary, defaults, prompts, and escalation rules appropriate to the vertical (for medspas: tox, filler, microneedling, laser hair removal, IV therapy, weight management, consultations).
Integrations. Native depth and breadth of connections to the booking platforms, CRMs, and calendars relevant to the vertical.
Pricing transparency. Clarity of usage reporting, predictability of monthly cost, and presence of hard spend caps.
Compliance & support. Business Associate Agreement availability on the relevant plan tier, default AI-disclosure behavior, data-retention controls, documented response SLAs, and the quality of onboarding and ongoing vendor support observed during the test window.

Two-reviewer scoring is used, with the second reviewer blinded to vendor identity wherever the evidence format allows. Hallucinations — confidently stated incorrect facts — are tracked as a separate metric and weighted into the vertical-fit score. The scoring schedule is finalized before any vendor is contacted.

Conflicts of interest

Some vendors covered in our reviews participate in affiliate relationships with us; that participation is disclosed on our disclosure page and on each affected review. Affiliate participation is offered to vendors covered in a given review; a vendor may decline. Declining does not affect inclusion, ranking, or scoring. Affiliate terms may differ by vendor, but affiliate availability, payout amount, cookie window, and commercial terms never affect inclusion, score, ranking, recommended-pick status, or criticism. Rankings and scores are finalized and editorially locked before any commercial conversation begins.

Update cadence

Each flagship review is re-tested in full on a quarterly cycle. Between cycles, we run a lightweight monthly spot-check focused on booking accuracy and any vendor-announced platform changes. When a vendor ships a material product update (a new voice model, a new integration in our reference category, a pricing-tier change, or a documented change in hallucination behavior), we re-score the affected dimensions and update the review out of cycle, with a dated change-log entry at the bottom of the article.

Vendor right of reply

Before a review is published, we share the relevant factual section with the vendor for a strictly factual review with a fixed reply window. Vendors may flag factual inaccuracies (for example, a misstatement about an integration or a pricing plan), and we confirm and correct any verified factual error before publication. Vendors do not see scores, rankings, recommended-pick selection, or editorial commentary prior to publication, and the right of reply does not extend to those elements.

Corrections

We correct material errors in place and append a dated correction note. If a vendor ships an update that materially changes our scoring, we re-test on the next quarterly cycle and note the change in the update log. Readers and vendors can submit suspected errors via the corrections page or by emailing the editor directly; we respond within five business days and target a published correction inside ten business days of confirming the error.

Methodology version 2026.05.