AI Had a Blind Spot. I Fixed It with One Line of Prompt

WC2026 AI Prediction Arena — How prompt engineering fixed AI's draw blind spot — Seven AI platforms. Zero draw predictions. One line of prompt changed everything.

Netherlands 2–2 Japan. Five of seven AI platforms called the draw. Seventy-two hours earlier, none of them would have.

Same tournament. Same platforms. Same prediction system I described in the first post. The only thing that changed was one line in the prompt.

That line is the story.

Zero for twenty-one

For the first three days of the WC2026 AI Prediction Arena, I ran seven AI platforms — ChatGPT, Claude, Gemini, Grok, Perplexity, DeepSeek, Kimi — through six completed matches.

Twenty-one individual predictions. Zero draws.

Three draws actually happened. Qatar 1–1 Switzerland. Brazil 1–1 Morocco. Canada 1–1 Bosnia.

Every time, all seven platforms picked a winner. 100% consensus. 100% wrong.

After six matches, the best performers — ChatGPT, Gemini, Grok, Kimi — sat at 50%. Claude, DeepSeek, and Perplexity were at 33%.

That is not random error. That is structural.

Why AI defaults to winners

The information economy is biased toward decisive outcomes. News leads with winners. Previews lead with favourites. Highlights lead with goals.

Draws don’t generate headlines. They’re the non-events of the sports internet — and that means they’re under-represented in the data AI retrieves when it makes a prediction.

Every platform knew draws exist. But when forced to commit, they reached for the favourite. Every time.

This is exactly the kind of structural bias that only shows up when you test the system at scale, in public, with locked predictions. Which is why I built the arena.

One line

I added a single sentence to the prediction prompt:

World Cup group-stage matches historically produce draws approximately 25–30% of the time. Do not avoid predicting a draw if the evidence supports it.

No model swap. No architecture change. One line of context.

The first calibrated batch: 3 draw predictions out of 21. The evening batch overcorrected — 13 out of 21 were draws. The pendulum swung too far.

But it proved the thesis: the answer depends on the question.

Then Netherlands played Japan

The calibrated system had five of seven platforms on a draw. The match finished 2–2. Kamada equalised in the 88th minute.

Before the fix, every draw was a 0/7 miss. After the fix, the arena called one at 71.4% consensus.

Match	Result	Consensus	Hit rate
🏴󠁧󠁢󠁳󠁣󠁴󠁿 Scotland 1–0 Haiti 🇭🇹	Scotland	57% Scotland	4/7
🇦🇺 Australia 2–0 Turkey 🇹🇷	Australia	86% Turkey	0/7
🇩🇪 Germany 7–1 Curaçao 🇨🇼	Germany	100% Germany	7/7
🇳🇱 Netherlands 2–2 Japan 🇯🇵	Draw	71% Draw	5/7

Updated leaderboard after 10 matches:

Platform	Correct	Accuracy
ChatGPT	6/10	60%
Grok	6/10	60%
Gemini	5/10	50%
Kimi	5/10	50%
DeepSeek	4/10	40%
Perplexity	4/10	40%
Claude	3/10	30%

The blind spot was measurable. It was fixable. And the fix worked on the next live match.

What it didn’t fix

Same day. Australia beat Turkey 2–0. Six of seven platforms picked Turkey. Consensus was 86%. All wrong.

The draw calibration addressed one failure mode — models suppressing a common result type. It didn’t help with cold upsets. Nobody saw Irankunda coming.

An experiment that hides its failures isn’t an experiment; it’s an ad. The arena keeps both, and the later 1,823-prediction GEO analysis shows what those accumulated results revealed about AI judgment.

This is what GEO Foresight does

The football is a proof-of-concept.

At Tocanan, we run a system called GEO Foresight that does the same thing for brands. Carefully engineered questions, asked across ChatGPT, Gemini, Perplexity, Claude, Grok, DeepSeek, Kimi, and Chinese-language AI platforms — surfacing how AI actually perceives your brand, your category, your competitors.

The principle is identical: if you don’t design the question properly, the AI gives you a structurally biased answer. If you don’t tell it to consider draws, it won’t pick one. If you don’t ask it the right questions about your brand, you won’t see the blind spots.

You might think you’re visible. ChatGPT might recommend you. But Gemini might not mention you. Perplexity might cite your competitor instead.

Same question, same day, different platforms, different realities.

That gap is what we measure. audit.tocanan.ai — five minutes, free. See what AI currently says about you.

Follow the experiment

The arena runs daily through the final on 19 July. Every prediction locks before kickoff. Every result stays visible.

Live tracker: wc26.tocanan.ai

Next week: does the draw calibration hold, or does AI find a new way to be confidently wrong?

Frequently Asked Questions

What is prompt engineering in AI predictions?

Prompt engineering is the design of the question you give an AI system. In this experiment, adding one line of historical context — the base rate of draws in World Cup group stages — shifted the output from zero draw predictions to a majority of them. The same sensitivity applies to any question you ask AI about your industry or brand.

How does question design affect AI answers about brands?

A generic question gets a generic answer — usually the biggest names in the category. A sharper, more specific question reveals positioning gaps, competitor mentions, citation sources, and platform-specific blind spots that brands don’t see until someone asks the right question the right way.

What is GEO Foresight?

GEO Foresight is Tocanan’s intelligence system for tracking how AI platforms represent brands. It uses engineered question sets across seven global and Chinese-language AI platforms to measure visibility, citation authority, competitive positioning, and divergence — then surfaces the gaps.

Can AI predict football matches accurately?

AI is strong on obvious favourites and weak on uncertainty. After ten matches, the best platform sits at 60% — better than a coin flip, worse than a bookie. The WC2026 AI Prediction Arena is designed to test exactly where that confidence breaks down.

About the Author

Eden Lau is CEO of Tocanan.ai, a GEO intelligence company that tracks how AI platforms represent brands across ChatGPT, Gemini, Perplexity, Claude, Grok, DeepSeek, and Kimi. With 30+ years in marketing data strategy, he previously co-founded Brandtology. Connect on LinkedIn.

AI Had a Blind Spot. I Fixed It with One Line of Prompt.