Analysis · 3 June 2026

Asking more didn't change the answer

Most of my questions run once per assistant. I ran one category three times to check whether that was a mistake. It wasn't, and that is the reason the index stays cheap enough to run every month.

AI answers are not deterministic. Ask the same assistant the same thing twice and the list can come back different. The obvious response is to ask many times and average. The question is how many. Every extra repeat multiplies the cost of a run, and a thing that is expensive to run monthly quietly stops being run monthly.

My answer is that the averaging should come from the number of questions, not the number of repeats. Each category is thirty-odd frozen prompts. That spread already smooths out the luck of a single draw, the same way a survey gets its stability from many respondents, not from asking one person five times. So the index asks each question once.

The test

Before trusting that, I checked it. The fitness category was run three times per assistant instead of once, on the three engines that answer by API: ChatGPT, Claude and Gemini. That gives three independent single-run leaderboards sitting inside one triple-sampled one. If asking once were too noisy, the three single runs would disagree with each other and with the average.

At the level the index actually reports, they held. The published number is a score combined across the assistants, and on that combined ranking the same five apps led every single run: Nike Training Club, Strong, Fitbod, Hevy and JEFIT. Nike Training Club topped the combined ranking and two of the three runs; in the first it tied at the top, level with Fitbod at twenty-four mentions, before pulling clear. The order below fifth place moved around, but the set of leaders did not.

#	Three samples	Run 1	Run 2	Run 3
1	Nike Training Club 83	Fitbod 24	Nike Training Club 28	Nike Training Club 31
2	Strong 72	Nike Training Club 24	Strong 25	Strong 26
3	Fitbod 68	Strong 21	Fitbod 23	Fitbod 21
4	Hevy 59	Hevy 21	Hevy 19	Hevy 19
5	JEFIT 52	JEFIT 17	JEFIT 19	JEFIT 16

Combined ranking across ChatGPT, Claude and Gemini, each sampled three times. The number is total mentions in that column. The same five apps lead on every run; only the order at the very top and the field below fifth place move. In run 1 the top two were tied at 24 mentions, and by runs 2 and 3 Nike Training Club had pulled clear.

Drop down to a single assistant and it gets noisier. ChatGPT, Claude and Gemini each crowned a different leader, and an engine's own pick sometimes shifted between its own runs. That is the reason the index averages across several assistants and many questions instead of trusting one answer. The per-engine wobble cancels, and what survives is the app named consistently, by more than one assistant, across the whole set.

Where the runs did differ was the part that is already fragile. The exact rate an app scored wobbled by a few points between draws. The order of the mid-pack, fifth place and below, reshuffled. Apps named in only one or two answers sometimes vanished from a single run entirely. None of that is the story a reader takes away. The story is who leads, by how much, and whether the category is settled or open, and that survived being cut to a third of the data.

So I ask once

A single sample is a coarser measurement than three, and I say so: treat a rate as a rough share, not a decimal-precise one, and treat the long tail as a sighting rather than a ranking. What a single sample buys in return is a run that costs a third as much, which is the difference between a monthly index and a one-off. The point of this project is the time-series, and the time-series only exists if every month is affordable.

This is also why the fitness page is the one place you'll see three samples rather than one. Those are the calibration numbers, kept on the page rather than thrown away. Every other category runs at one sample per question, for the reason shown above. The full sampling rule lives in the methodology.