Analysis · 3 June 2026
Asking more didn't change the answer
Most of my questions run once per assistant. I ran one category three times to check whether that was a mistake. It wasn't, and that is the reason the index stays cheap enough to run every month.
AI answers are not deterministic. Ask the same assistant the same thing twice and the list can come back different. The obvious response is to ask many times and average. The question is how many. Every extra repeat multiplies the cost of a run, and a thing that is expensive to run monthly quietly stops being run monthly.
My answer is that the averaging should come from the number of questions, not the number of repeats. Each category is thirty-odd frozen prompts. That spread already smooths out the luck of a single draw, the same way a survey gets its stability from many respondents, not from asking one person five times. So the index asks each question once.
The test
Before trusting that, I checked it. The fitness category was run three times per assistant instead of once, on the three engines that answer by API: ChatGPT, Claude and Gemini. That gives three independent single-run leaderboards sitting inside one triple-sampled one. If asking once were too noisy, the three single runs would disagree with each other and with the average.
At the level the index actually reports, they held. The published number is a score combined across the assistants, and on that combined ranking the same five apps led every single run: Nike Training Club, Strong, Fitbod, Hevy and JEFIT. Nike Training Club topped the combined ranking and two of the three runs; in the third, the top three were tied at twenty-four mentions each. The order below fifth place moved around, but the set of leaders did not.
| # | Three samples | Run 1 | Run 2 | Run 3 |
|---|---|---|---|---|
| 1 | Nike Training Club 83 | Strong 24 | Nike Training Club 28 | Nike Training Club 31 |
| 2 | Strong 78 | Fitbod 24 | Strong 27 | Strong 27 |
| 3 | Fitbod 68 | Nike Training Club 24 | Fitbod 23 | Fitbod 21 |
| 4 | Hevy 59 | Hevy 21 | Hevy 19 | Hevy 19 |
| 5 | JEFIT 52 | JEFIT 17 | JEFIT 19 | JEFIT 16 |
Drop down to a single assistant and it gets noisier. ChatGPT, Claude and Gemini each crowned a different leader, and an engine's own pick sometimes shifted between its own runs. That is the reason the index averages across several assistants and many questions instead of trusting one answer. The per-engine wobble cancels, and what survives is the app named consistently, by more than one assistant, across the whole set.
Where the runs did differ was the part that is already fragile. The exact rate an app scored wobbled by a few points between draws. The order of the mid-pack, fifth place and below, reshuffled. Apps named in only one or two answers sometimes vanished from a single run entirely. None of that is the story a reader takes away. The story is who leads, by how much, and whether the category is settled or open, and that survived being cut to a third of the data.
So I ask once
A single sample is a coarser measurement than three, and I say so: treat a rate as a rough share, not a decimal-precise one, and treat the long tail as a sighting rather than a ranking. What a single sample buys in return is a run that costs a third as much, which is the difference between a monthly index and a one-off. The point of this project is the time-series, and the time-series only exists if every month is affordable.
This is also why the fitness page is the one place you'll see three samples rather than one. Those are the calibration numbers, kept on the page rather than thrown away. Every other category runs at one sample per question, for the reason shown above. The full sampling rule lives in the methodology.