Meta, Google, and OpenAI allegedly exploited undisclosed private testing on Chatbot Arena to secure top rankings, raising concerns about fairness and transparency in AI model benchmarking.
 
					A handful of dominant AI companies have been quietly manipulating one of the most influential public leaderboards for chatbot models, potentially distorting perceptions of model performance and undermining open competition, according to a new study.
The research, titled “The Leaderboard Illusion,” was published by a team of experts from Cohere Labs, Stanford University, Princeton University, and other institutions. It scrutinized the operations of Chatbot Arena, a widely used public platform that allows users to compare generative AI models through pairwise voting on model responses to user prompts.
The study revealed that major tech firms — including Meta, Google, and OpenAI — were given privileged access to test multiple versions of their AI models privately on Chatbot Arena. By selectively publishing only the highest-performing versions, these companies were able to boost their rankings, the study found.
“Chatbot Arena currently permits a small group of preferred providers to test multiple models privately and only submit the score of the final preferred version,” the study said.
Chatbot Arena, Google, Meta, and OpenAI did not respond to requests for comments on the study.
Private testing privilege skews rankings
The Chatbot Arena, launched in 2023, has rapidly become the go-to public benchmark for evaluating generative AI models through pairwise human comparisons. However, the new study reveals systemic flaws that undermine its integrity, most notably the ability of select developers to conduct undisclosed private testing.
Meta reportedly tested 27 separate large language model variants in a single month in the lead-up to its Llama 4 release. Google and Amazon also submitted multiple hidden variants. In contrast, most smaller firms and academic labs submitted just one or two public models, unaware that such behind-the-scenes evaluation was possible.
This “best-of-N” submission strategy, the researchers argue, violates the statistical assumptions of the Bradley-Terry model — the algorithm Chatbot Arena uses to rank AI systems based on head-to-head comparisons.
To demonstrate the effect of this practice, the researchers conducted their own experiments on Chatbot Arena. In one case, they submitted two identical checkpoints of the same model under different aliases. Despite being functionally the same, the two versions received significantly different scores — a discrepancy of 17 points on the leaderboard.
In another case, two slightly different versions of the same model were submitted. The variant with marginally better alignment to Chatbot Arena’s feedback dynamics outscored its sibling by nearly 40 points, with nine models falling in between the two in the final rankings.
Disproportionate access to data
The leaderboard distortion isn’t just about testing privileges. The study also highlights stark data access imbalances. Chatbot Arena collects user interactions and feedback data during every model comparison — data that can be crucial for training and fine-tuning models.
Proprietary LLM providers such as OpenAI and Google received a disproportionately large share of this data. According to the study, OpenAI and Google received an estimated 19.2% and 20.4% of all Arena data, respectively. In contrast, 83 open-weight models shared only 29.7% of the data. Fully open-source models, which include many from academic and nonprofit organizations, collectively received just 8.8% of the total data.
This uneven distribution stems from preferential sampling rates, where proprietary models are shown to users more frequently, and from opaque deprecation practices. The study uncovered that 205 out of 243 public models had been silently deprecated — meaning they were removed or sidelined from the platform without notification — and that open-source models were disproportionately affected.
“Deprecation disproportionately impacts open-weight and open-source models, creating large asymmetries in data access over time,” the study stated.
These dynamics not only favor the largest companies but also make it harder for new or smaller entrants to gather enough feedback data to improve or fairly compete.
Leaderboard scores don’t always reflect real-world capability
One of the study’s key findings is that access to Arena-specific data can significantly boost a model’s performance — but only within the confines of the leaderboard itself.
In controlled experiments, researchers trained models using different proportions of Chatbot Arena data. When 70% of the training data came from the Arena, the model’s performance on ArenaHard — a benchmark set that mirrors Arena distribution — more than doubled, rising from a win rate of 23.5% to 49.9%.
However, this performance bump did not translate into gains on broader academic benchmarks such as Massive Multitask Language Understanding(MMLU), which is a benchmark designed to measure knowledge acquired during pretraining by evaluating models. In fact, results on MMLU slightly declined, suggesting the models were tuning themselves narrowly to the Arena environment.
“Leaderboard improvements driven by selective data and testing do not necessarily reflect broader advancements in model quality,” the study warned.
Call for transparency and reform
The study’s authors said these findings highlight a pressing need for reform in how public AI benchmarks are managed.
They have called for greater transparency, urging Chatbot Arena organizers to prohibit score retraction, limit the number of private variants tested, and ensure fair sampling rates across providers. They also recommend that the leaderboard maintain and publish a comprehensive log of deprecated models to ensure clarity and accountability.
“There is no reasonable scientific justification for allowing a handful of preferred providers to selectively disclose results,” the study added. “This skews Arena scores upwards and allows a handful of preferred providers to game the leaderboard.”
The researchers acknowledge that Chatbot Arena was launched with the best of intentions — to provide a dynamic, community-driven benchmark during a time of rapid AI development. But they argue that successive policy choices and growing pressure from commercial interests have compromised its neutrality.
While Chatbot Arena organizers have previously acknowledged the need for better governance, including in a blog post published in late 2024, the study suggests that current efforts fall short of addressing the systemic bias.
What does it mean for the AI industry?
The revelations come at a time when generative AI models are playing an increasingly central role in business, government, and society. Organizations evaluating AI systems for deployment — from chatbots and customer support to code generation and document analysis — often rely on public benchmarks to guide purchasing and adoption decisions.
If those benchmarks are compromised, so too is the decision-making that depends on them.
The researchers warn that the perception of model superiority based on Arena rankings may be misleading, especially when top placements are influenced more by internal access and tactical disclosure than actual innovation.
“A distorted scoreboard doesn’t just mislead developers,” the study noted. “It misleads everyone betting on the future of AI.”

 
    
Waldo Balistreri
İnstagram Beğeni Al
비아그라 구입
I all the time emailed this website post page to all my contacts, for the reason that if like to read it
then my contacts will too.
Sammy McLaughlin
İnstagram Beğeni Al
Eric Christiansen
İnstagram Beğeni Al
Estelle Jaskolski
Epin scriptti al
Bailee Spinka
Epin scriptti al
Wei$$
I love it when people get together and share views.
Great website, continue the good work!
수원 교통사고 한의원
수원 교통사고 한의원 장** 원장은 “차량사고 조취는 물리조취뿐만 아니라 한약 요법, 침, 뜸, 부항, 추나 요법, 약침 요법 등 비교적 다양한 범위의 치료가 가능하다는 이점이 있어 운전사고로 한의원을 찾는 환자분들이 꾸준히 늘고 있다”라면서 “가벼운 운전사고라고 놔두지 마시고 사고 초기에 내원하여 요법를 받아야 만성 통증으로 발전하지 않고 차량사고 후유증을 최소화할 수 있다”라고 말했다.
수원 다이어트 한약
Vavada
Hi i am kavin, its my first time to commenting anywhere,
when i read this piece of writing i thought i could also make comment due to this brilliant article.
Keluaran HK Lotto
This site really has all the information I wanted about this subject and didn’t know who to ask.
онлайн казино 1Win
Oh my goodness! Incredible article dude! Many thanks, However I am having issues with your RSS.
I don’t know the reason why I am unable to subscribe to it.
Is there anybody else getting identical RSS issues?
Anyone who knows the solution can you kindly respond?
Thanks!!
Buy Fast Proxies
Please let me know if you’re looking for a writer for your weblog.
You have some really great articles and I feel I would be a good asset.
If you ever want to take some of the load off, I’d absolutely love to write some articles for your blog in exchange for a link back to mine.
Please send me an e-mail if interested. Regards!
nonton bokep tanpa VPN
What’s up Dear, are you truly visiting this website regularly,
if so after that you will without doubt get pleasant experience.
link bokep anak kecil
Excellent pieces. Keep posting such kind of info on your blog.
Im really impressed by your blog.
Hello there, You’ve done an incredible job. I will certainly digg it and for my part suggest to my friends.
I’m confident they will be benefited from this site.
Brianraf
https://t.me/s/official_vega
정품 비아그라
이것은 정말 주목할 만한 것입니다, 당신은
매우 전문적인 블로거입니다. 당신의 피드에 가입했고, 당신의 환상적인 포스트를 더 찾고 있습니다.
또한, 제 소셜 네트워크에서 당신의 사이트를 공유했습니다.
정품 비아그라
KO
비아그라 판매
오늘, 저는 아이들들과 해변에 갔습니다.
조개껍데기를 발견해서 제 4살 딸에게 주며 “이걸 귀에 대면 바다 소리를 들을 수 있어”라고 했습니다.
그녀가 조개껍데기를 귀에 대자 비명을 질렀습니다.
안에 소라게가 있어서 그녀의 귀를 집었거든요.
그녀는 다시는 돌아가고 싶어하지 않습니다!
LoL 이건 완전 주제에서 벗어났지만 누군가에게 말하고 싶었어요!
I do not even know how I ended up here, but
I thought this post was good. I do not know who you are
but certainly you’re going to a famous blogger if you aren’t already 😉 Cheers!