Methodology

How this is measured

An index is only as trustworthy as its method, so here is the whole thing, down to the arithmetic. Nothing about a run changes between months unless the method version changes with it, and a version change leaves a visible break in the timeline. The full prompt sets are published verbatim.

The questions

Each category has a fixed set of prompts, written the way people actually ask. They are built on two axes. The first is awareness, after Eugene Schwartz: how much the person already knows, from a vague problem ("I keep overspending and never know where my money goes") up to a named head-to-head ("YNAB vs Monarch"). The second is the job to be done: the distinct reasons someone reaches for a category, like budgeting on a shared account versus zero-based budgeting solo.

  • Problem-aware. Names a problem, no solution. "How do I stop overspending?"
  • Solution-aware. Knows the category, wants options. "What kinds of budgeting apps are there?"
  • Product-aware. The open recommendation questions. "Best budgeting app", "best free budgeting app". This is where the ranking comes from, because the assistant is free to name whatever it wants.
  • Most-aware. Already has a brand in mind. "Is YNAB worth it?" These seed a brand into the question, so they are read for sentiment and rivals, not counted as discovery.

The set is frozen. Editing a prompt creates a new version with a documented discontinuity, so a trend is never quietly broken. The AI-apps set is frozen as of 3 June 2026; budgeting and fitness are pending final sign-off.

The assistants

Five answer surfaces, each run with web search on, because that is what a person actually gets. The models are the fast, cheap tier ("explore"), not the flagship one. That is a deliberate choice for a research preview that re-runs every month: the cost of the top tier is roughly ten times higher for a result that moves the leaders very little. A flagship tier ("fidelity"), closer to the exact model a logged-in consumer hits, is the planned upgrade if the index earns the spend. Every model ID is pinned and logged below.

AssistantModel (explore tier)GroundingSamples
ChatGPT gpt-5.4-mini Responses API, web_search tool 34
Claude claude-haiku-4-5 Messages API, web_search_20250305 tool 34
Gemini gemini-3.1-flash-lite google_search grounding tool 34
Google AI AI Overviews AI Overview, captured from the page 6
Mistral mistral-small-latest Agents API, web-search connector 34

Google AI Overviews is the odd one out: there is no API, so it is captured from the search page itself, for a six-prompt subset rather than the full set. Treat that column as directional. When a vendor updates the default model, that is logged as an event, because a shift in the answers often traces to the model rather than to anything the apps did.

Scoring

Two numbers, both computed in code, neither asked of a model. For an assistant, the rate is the share of its answers that named the app:

rate(app, assistant) = answers naming the app / total answers × 100

The score is the mean of those rates across the grounded assistants, so an app that leads on one engine but is absent on the rest cannot top the board:

score(app) = mean over grounded assistants of rate(app, assistant)

Google's six-prompt capture is included in that mean as one engine. A score of 47 means: on average, across the assistants, 47% of answers named the app.

Sampling: why each question runs once

AI answers are not deterministic. The same question can return a different list twice in a row. The averaging that smooths this out comes from the number of questions, not from repeating each one. A category is thirty-odd frozen prompts, and that spread is what makes a run stable, the way a survey is stable because it has many respondents, not because it asks one person five times. So each question runs once per assistant (k=1), and an app that surfaces in only a handful of answers is fragile, which is itself worth knowing.

I tested that one sample is enough rather than assuming it. Fitness was run three times per assistant, and every run reproduced the same leaders as the average; only the exact rates and the long tail moved. That calibration is written up in asking more didn't change the answer.

What gets counted, and how

Each answer is parsed against a known roster of apps by code, using each app's name plus a list of aliases. Three rules keep the count honest:

  • Prompt-seeded exclusion. An app named in the question itself ("Is YNAB worth it?", "YNAB vs Monarch") does not count as the assistant discovering it. Otherwise a branded question would inflate the brand it mentions.
  • Collision handling. Names that are ordinary words or bigger brands ("Strong", "Copilot", "Empower") are matched case-sensitively and checked in context, so the adjective "strong" is not counted as the app Strong. These rows are flagged.
  • Dead-entity tracking. A few discontinued apps are kept on the roster on purpose (Mint, shut down 2024; Bard, renamed 2024). Whether an assistant still recommends something that no longer exists is one of the more revealing things this measures.

Limitations, stated plainly

  • The API is a clean, repeatable stand-in for the consumer apps, which add memory and personalisation. The explore tier widens that gap further. A periodic calibration measures how far the two drift apart.
  • This index is small next to what a vendor with proprietary data can run. The value here is that it is public, neutral, and the same every month.
  • These are previews, in US English. A German track is planned as a separate frozen set.