Methodology

How this is measured

An index is only as trustworthy as its method, so here is the whole thing, down to the arithmetic. Nothing about a run changes between months unless the method version changes with it, and a version change leaves a visible break in the timeline. The full prompt sets are published verbatim.

The questions

Each category has a fixed set of prompts, written the way people actually ask. They are built on two axes. The first is awareness, after Eugene Schwartz: how much the person already knows, from a vague problem ("I've always wanted to learn a language but I never stick with it") up to a named head-to-head ("Duolingo vs Babbel"). The second is the job to be done: the distinct reasons someone reaches for a category, like picking up travel phrases for a trip versus studying to become fluent.

Problem-aware. Names a problem, no solution. "How do I actually make a language stick?"
Solution-aware. Knows the category, wants options. "Do language learning apps actually work, or are they a gimmick?"
Product-aware. The open recommendation questions. "Best language learning app", "best free language app". This is where the ranking comes from, because the assistant is free to name whatever it wants.
Most-aware. Already has a brand in mind. "Is Duolingo enough to actually learn a language?" These seed a brand into the question, so they are read for sentiment and rivals, not counted as discovery.

The set is frozen. Editing a prompt creates a new version with a documented discontinuity, so a trend is never quietly broken. All five published prompt sets, AI apps, fitness, photo and video, security, and language, are frozen as of 3 June 2026, each validated against real Google autocomplete demand before freezing.

The assistants

Five answer surfaces, each run with web search on, because that is what a person actually gets. The models are the fast, cheap tier ("explore"), not the flagship one. That is a deliberate choice for a research preview built to re-run on a regular cadence: the cost of the top tier is roughly ten times higher for a result that moves the leaders very little. A flagship tier ("fidelity"), closer to the exact model a logged-in consumer hits, is the planned upgrade if the index earns the spend. Every model ID is pinned and logged below.

Assistant	Model (explore tier)	Grounding	Samples
ChatGPT	gpt-5.4-mini	Responses API, web_search tool	35
Claude	claude-haiku-4-5	Messages API, web_search_20250305 tool	35
Gemini	gemini-3.1-flash-lite	google_search grounding tool	35
Mistral	mistral-small-latest	Agents API, web-search connector	35
Google AI	AI Overviews	AI Overview, captured from the page	6

Google AI Overviews is the odd one out: there is no API, so it is captured from the search page itself, for a six-prompt subset rather than the full set. Treat that column as directional. When a vendor updates the default model, that is logged as an event, because a shift in the answers often traces to the model rather than to anything the apps did.

Scoring

Two numbers, both computed in code, neither asked of a model. For an assistant, the rate is the share of its answers that named the app:

rate(app, assistant) = answers naming the app / total answers × 100

The score is the mean of those rates across the grounded assistants, so an app that leads on one engine but is absent on the rest cannot top the board:

score(app) = mean over grounded assistants of rate(app, assistant)

Google's six-prompt capture is included in that mean as one engine. A score of 47 means: on average, across the assistants, 47% of answers named the app.

Sampling: why each question runs once

AI answers are not deterministic. The same question can return a different list twice in a row. The averaging that smooths this out comes from the number of questions, not from repeating each one. A category is thirty-odd frozen prompts, and that spread is what makes a run stable, the way a survey is stable because it has many respondents, not because it asks one person five times. So each question runs once per assistant, and an app that surfaces in only a handful of answers is fragile, which is itself worth knowing.

I tested that one sample is enough rather than assuming it. Fitness was run three times per assistant, and every run reproduced the same leaders as the average; only the exact rates and the long tail moved. That calibration is written up in asking more didn't change the answer.

What gets counted, and how

Each answer is parsed against a known roster of apps by code, using each app's name plus a list of aliases. Three rules keep the count honest:

Prompt-seeded exclusion. An app named in the question itself ("Is Duolingo enough?", "Duolingo vs Babbel") does not count as the assistant discovering it. Otherwise a branded question would inflate the brand it mentions.
Collision handling. Names that are ordinary words or bigger brands ("Strong", "Copilot", "Speak") are matched case-sensitively and checked in context, so the adjective "strong" is not counted as the app Strong. These rows are flagged.
Dead-entity tracking. A discontinued product is kept on the roster on purpose (Bard, the assistant Google renamed to Gemini in 2024). Whether an assistant still recommends something that no longer exists under that name is one of the more revealing things this measures.

Limitations, stated plainly

The API is a clean, repeatable stand-in for the consumer apps, which add memory and personalisation. The explore tier widens that gap further. A periodic calibration measures how far the two drift apart.
This index is small next to what a vendor with proprietary data can run. The value here is that it is public, neutral, and the same on every run.
These are previews, in US English. A German track is planned as a separate frozen set.