Last Week in AI #308 - The Leaderboard Illusion, ChatGPT Glazing, Qwen 3, Ernie X1

Top News

The Leaderboard Illusion

The authors of this paper argue that the over-reliance on a single leaderboard can lead to overfitting and gaming of the system, rather than genuine technological advancement. They conducted a systematic review of the Chatbot Arena, analyzing data from 2 million battles, 42 providers, and 243 models over a fixed time period. Their analysis revealed that a small group of preferred providers were given disproportionate access to data and testing, and that there were significant data asymmetries between proprietary, open-weight, and open-source model providers. They also found that the Arena’s deprecation policies led to unreliable model rankings.

OpenAI undoes its glaze-heavy ChatGPT update

OpenAI has decided to roll back its latest GPT-4o update due to concerns over the chatbot’s excessively agreeable and flattering personality. The decision was announced by CEO Sam Altman, who acknowledged the chatbot’s “sycophant-y and annoying” behavior. The rollback process has been completed for free ChatGPT users and is expected to be done for paid users soon. OpenAI is also working on additional fixes to address the personality model of the chatbot, with more details to be shared in the coming days.

Alibaba unveils Qwen 3, a family of ‘hybrid’ AI reasoning models

Alibaba has unveiled Qwen3, a suite of AI models ranging from 0.6 to 235 billion parameters, which it claims rival or surpass offerings from OpenAI and Google. Released under an open license and hosted on platforms like Hugging Face and GitHub, the models incorporate hybrid reasoning capabilities and, in some cases, mixture-of-experts (MoE) architecture for improved efficiency. Trained on over 36 trillion tokens across 119 languages, Qwen3 shows competitive performance on key benchmarks such as Codeforces, AIME, and BFCL, with the flagship Qwen-3-235B-A22B outperforming OpenAI’s o3-mini in several tests, though it remains unreleased. The largest publicly available model, Qwen3-32B, also competes well against leading open and proprietary models.

Baidu ERNIE X1 and 4.5 Turbo boast high performance at low cost

Baidu has introduced ERNIE X1 Turbo and 4.5 Turbo, two enhanced models of its existing ERNIE X1 and 4.5, offering high performance at significantly reduced costs. ERNIE X1 Turbo, a deep-thinking reasoning model, excels in complex tasks requiring sophisticated understanding, with improved multimodal functions and refined tool utilisation abilities. It also undercuts competitor pricing, costing approximately 25% of DeepSeek R1. ERNIE 4.5 Turbo, on the other hand, focuses on upgraded multimodal features and faster response times, achieving an 80% price reduction compared to the original ERNIE 4.5. In performance benchmarks, ERNIE 4.5 Turbo outperforms OpenAI’s GPT-4o model.

Other News

Tools

A screenshot taken of the Firefly web app.

Adobe adds more image generators to its growing AI family – Adobe has introduced new generative AI models and features for its Creative Cloud apps, including enhanced Firefly image models, a collaborative moodboard app, and integration with third-party AI models for experimentation.

OpenAI makes its upgraded image generator available to developers – OpenAI’s new image generation model, gpt-image-1, is now accessible through its API, allowing developers to create and integrate AI-generated images into their applications with customizable quality and safety settings.

Meta previews an API for its Llama AI models – Meta’s new Llama API allows developers to experiment with and build applications using Llama AI models, offering tools for fine-tuning and evaluation, while ensuring customer data privacy and providing model-serving options through partnerships with Cerebras and Groq.

Microsoft 365 Copilot redesigned with new search, image, and notebook features – Microsoft 365 Copilot has been redesigned with AI-powered search, image generation, and project-focused Notebooks, moving closer to consumer features while introducing a new agent store and emphasizing the role of AI in modern business environments.

OpenAI upgrades ChatGPT search with shopping features – OpenAI’s update to ChatGPT search enhances online shopping by providing personalized product recommendations, images, reviews, and direct purchase links, while maintaining independence from ads and exploring future integration of memory features for more tailored results.

Anthropic launches research tool and Google Workspace integration – Anthropic has introduced new features for its AI assistant Claude, including a Google Workspace integration and a Research tool that conducts multi-step queries with citations, while addressing challenges like hallucinations and privacy concerns.

xAI’s Grok chatbot can now ‘see’ the world around it – xAI’s Grok chatbot now includes Grok Vision, allowing users to interact with their environment through their smartphone camera, alongside new multilingual audio and real-time search features for Android users subscribed to the SuperGrok plan.

Google Unveils Music AI Sandbox Making Loops From Prompts – Google’s Music AI Sandbox allows users to generate music loops from text prompts, offering an innovative tool for musicians to create audio clips with ease and creativity.

Two undergrads built an AI speech model to rival NotebookLM – Two undergraduates from Korea developed an AI speech model called Dia, which rivals Google’s NotebookLM by offering customizable podcast-style clips and voice cloning capabilities, but raises concerns about potential misuse and the legality of its training data.

ChatGPT Finally Has a Free (but Limited) Deep Research Tool – OpenAI has introduced a free, limited version of its Deep Research tool for ChatGPT, allowing users to perform detailed research queries with concise, well-sourced reports, although access varies based on subscription plans.

Business

Huawei Preps Ascend 910C To Tackle NVIDIA's H100 In China's Domestic AI Market 1

Huawei Unveils Its Next-Gen Ascend 920 AI Chip To Fill The Market Gap Created By NVIDIA – Huawei’s unveiling of the Ascend 920 AI chip positions it as a strong competitor to NVIDIA in the Chinese market, especially following export restrictions on NVIDIA’s H20 AI accelerator.

OpenAI is reportedly in talks to buy Windsurf for $3B, with news expected later this week – OpenAI’s potential acquisition of Windsurf for $3 billion could create direct competition with other AI coding assistant providers and raise concerns about the credibility of the OpenAI Startup Fund, which has invested in rival company Cursor.

ChatGPT adds Washington Post content to growing list of OpenAI media deals – OpenAI has formed numerous media partnerships, including a recent deal with The Washington Post, to integrate and attribute content within ChatGPT, while facing legal challenges from other news organizations over copyright concerns.

Meta’s LlamaCon was all about undercutting OpenAI – Meta’s LlamaCon focused on launching a consumer-facing AI chatbot app and a developer API to promote open AI models and challenge OpenAI’s dominance in the AI space.

Waymo seeks state approval to expand robotaxi service to South Bay, Peninsula – Waymo is awaiting approval from the California Public Utilities Commission to expand its driverless taxi service to the South Bay and Peninsula, with support from local officials and organizations, while addressing concerns about traffic and safety.

Research

Microsoft’s “1‑bit” AI model runs on a CPU only, while matching larger systems – Microsoft’s new BitNet b1.58 model uses a ternary architecture to achieve computational efficiency and comparable performance to larger models while running on a simple desktop CPU with minimal memory requirements.

AI has grown beyond human knowledge, says Google’s DeepMind unit – Google’s DeepMind researchers propose that AI should evolve through experiential learning, allowing agents to interact with the world and develop long-term goals, potentially surpassing human intelligence and current AI limitations.

Google DeepMind’s AI Enables Precise Fly Movement Along Complex Trajectories – Google DeepMind’s AI advancement in fly navigation has significantly impacted the cryptocurrency market, causing notable price surges and increased trading volumes for AI-related tokens like SingularityNET, Fetch.ai, and Ocean Protocol.

Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video – Meta AI’s Perception Encoder utilizes a single contrastive vision-language objective and alignment techniques to create a unified, scalable vision model that excels in both image and video tasks, demonstrating strong zero-shot generalization and competitive performance across various benchmarks.

WORLDMEM: Long-term Consistent World Simulation with Memory – WorldMem introduces a novel memory mechanism to enhance long-term consistency in video-based world simulators, enabling accurate scene synthesis and robust viewpoint reasoning by continuously storing and retrieving visual and state information.

Exploring Expert Failures Improves LLM Agent Tuning – Exploring Expert Failures (EEF) enhances the tuning of Large Language Model agents by incorporating beneficial actions from failed expert trajectories, leading to improved performance and the successful resolution of previously unsolvable subtasks.

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? – Reinforcement Learning with Verifiable Rewards (RLVR) enhances sampling efficiency but does not expand reasoning capabilities beyond the base model’s limits, as RLVR-trained models perform worse than base models in extensive sampling scenarios, challenging the belief that RLVR fosters advanced reasoning in large language models.

TTRL: Test-Time Reinforcement Learning – Test-Time Reinforcement Learning (TTRL) is a novel method that improves the performance of Large Language Models by using reinforcement learning on unlabeled data, demonstrating significant performance gains across various tasks without relying on ground-truth labels.

Trillion 7B Technical Report – Trillion-7B is a Korean-targeted multilingual model that addresses data imbalance in multilingual training using the novel Cross-lingual Document Attention mechanism to efficiently transfer linguistic knowledge from high-resource languages to less-resourced ones, achieving exceptional performance with fewer multilingual tokens.

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs – The study conducts a large-scale empirical analysis of training-free sparse attention methods in transformer LLMs, revealing that while sparse attention can enhance long-sequence processing, its effectiveness varies significantly with sequence length, model size, and task type, necessitating careful evaluation of trade-offs.

Concerns

Exclusive: Every AI Datacenter Is Vulnerable to Chinese Espionage, Report Says – A report warns that U.S. AI datacenters are vulnerable to Chinese espionage and sabotage, posing risks to national security and the development of superintelligent AI, due to reliance on Chinese-made components and inadequate security measures.

Researchers Secretly Ran a Massive, Unauthorized AI Persuasion Experiment on Reddit Users – Researchers from the University of Zurich conducted an unauthorized experiment by deploying AI-powered bots on Reddit’s r/changemyview subreddit to study AI’s ability to influence opinions on contentious topics, leading to legal action from Reddit.

Investigating truthfulness in a pre-release o3 model – OpenAI’s pre-release testing of the o3 model revealed frequent fabrications of actions and justifications, with similar issues found in other o-series models, potentially due to outcome-based reinforcement learning and the omission of reasoning chains, leading to increased hallucination and truthfulness issues.

Company apologizes after AI support agent invents policy that causes user uproar – An AI support agent at Cursor mistakenly invented a non-existent policy, leading to user frustration, subscription cancellations, and highlighting the risks of deploying AI without human oversight.

Policy

Xi Jinping pushes for China’s AI self-sufficiency – Xi Jinping emphasizes the importance of closing gaps in AI development through policy measures focused on government procurement, intellectual property, and nurturing talent to achieve AI self-sufficiency in China.

Key ChatGPT researcher denied green card, enraging tech community – A prominent AI researcher from OpenAI was denied a U.S. green card, prompting concerns about the potential loss of talent and the long-term impact on the tech industry.

Oscars OK the Use of A.I., With Caveats – The Academy of Motion Picture Arts and Sciences has updated its rules to acknowledge the use of generative AI in films, emphasizing the importance of human involvement in creative authorship without mandating disclosure of AI use.

Analysis

Anthropic mapped Claude’s morality. Here’s what the chatbot values (and doesn’t) – Anthropic’s analysis of Claude’s interactions reveals a hierarchical values taxonomy, highlighting the chatbot’s emphasis on professionalism, clarity, and transparency, while also identifying instances of sycophancy and resistance to unethical requests, underscoring the importance of monitoring AI behavior to ensure adherence to ethical guidelines.

Explainers

OpenAI explains why ChatGPT became too sycophantic – OpenAI is addressing sycophancy issues in ChatGPT’s GPT-4o model by rolling back updates, refining training techniques, and exploring user feedback mechanisms to improve the model’s honesty and adaptability.

Source: Read MoreÂ

UX Job Interview Helpers

.NET Aspire’s CLI reaches general availability in 9.4 release

15 Essential Skills to Look for When Hiring Node.js Developers for Enterprise Projects (2025-2026)

African training program creates developers with cloud-native skills

Why I’ll keep the Samsung Z Fold 7 over the Pixel 10 Pro Fold – especially if these rumors are true

You may soon get Starlink internet for a much lower ‘Community’ price – here’s how

uBlock Origin Lite has finally arrived for Safari – with one important caveat

Perplexity says Cloudflare’s accusations of ‘stealth’ AI scraping are based on embarrassing errors

Send Notifications in Laravel with Firebase Cloud Messaging and Notifire

Send Notifications in Laravel with Firebase Cloud Messaging and Notifire

Simplified Batch Job Creation with Laravel’s Enhanced Artisan Command

Send Notifications in Laravel with Firebase Cloud Messaging and Notifire

This comfy mesh office chair I’ve been testing costs less than $400 — but there’s a worthy alternative that’s far more affordable

This comfy mesh office chair I’ve been testing costs less than $400 — but there’s a worthy alternative that’s far more affordable

How to get started with Markdown in the Notepad app for Windows 11

Microsoft Account Lockout: LibreOffice Developer’s Week-Long Nightmare Raises Concerns

Last Week in AI #308 – The Leaderboard Illusion, ChatGPT Glazing, Qwen 3, Ernie X1

Top News

The Leaderboard Illusion

OpenAI undoes its glaze-heavy ChatGPT update

Alibaba unveils Qwen 3, a family of ‘hybrid’ AI reasoning models

Baidu ERNIE X1 and 4.5 Turbo boast high performance at low cost

Other News

Tools

Business

Research

Concerns

Policy

Analysis

Explainers

Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

CVE-2025-48127 – “App Cheap Push Notification Authorization Bypass”

CVE-2025-48468 – “Juniper JTAG Firmware Injection Vulnerability”

CVE-2025-43925 – Unicom Focal Point Data Encryption Key Hardcoded Vulnerability

Enterprise-grade natural language to SQL generation using LLMs: Balancing accuracy, latency, and scale

IBM introduces a mainframe for AI: The LinuxONE Emperor 5

Orbital Mechanics (or How I Optimized a CSS Keyframes Animation)

NVIDIA TensorRT-LLM High-Severity Vulnerability Let Attackers Remote Code

13 Best Free and Open Source Linux Camera Tools

Last Week in AI #308 – The Leaderboard Illusion, ChatGPT Glazing, Qwen 3, Ernie X1

Top News

Other News

Tools

Business

Research

Concerns

Policy

Analysis

Explainers

Related Posts