The rise of â€œopen sourceâ€ AI models: transparency and accountability in question

As the era of generative AI marches on, a broad range of companies have joined the fray, and the models themselves have become increasingly diverse.Â

Amidst this AI boom, many companies have touted their models as â€œopen source,â€ but what does this really mean in practice?

The concept of open source has its roots in the software development community. Traditional open-source software makes the source code freely available for anyone to view, modify, and distribute.Â

In essence, open source is a collaborative knowledge-sharing device fueled by software innovation, which has led to developments like the Linux operating system, Firefox web browser, and Python programming language.Â

However, applying the open-source ethos to todayâ€™s massive AI models is far from straightforward.Â

These systems are often trained on vast datasets containing terabytes or petabytes of data, using complex neural network architectures with billions of parameters.

The computing resources required cost millions of dollars, the talent is scarce, and intellectual property is often well-guarded.Â

We can observe this in OpenAI, which, as its namesake suggests, used to be an AI research lab largely dedicated to the open-source ethos.Â

However, that ethos quickly eroded once the company smelled the money and needed to attract investment to fuel its goals.

Why? Because open-source products are not geared towards profit, and AI is expensive and valuable.Â

However, as generative AI has exploded, companies like Mistral, Meta, BLOOM, and xAI are releasing open-source models to further research while preventing companies like Microsoft and Google from hoarding too much influence.

But how many of these models are truly open-source in nature, and not just by name?

Clarifying how open open-source models really are

In a recent study, researchers Mark Dingemanse and Andreas Liesenfeld from Radboud University, Netherlands, analyzed numerous prominent AI models to explore how open they are. They studied multiple criteria, such as the availability of source code, training data, model weights, research papers, and APIs.Â

For example, Metaâ€™s LLaMA model and Googleâ€™s Gemma were found to be simply â€œopen weight,â€ â€“ meaning the trained model is publicly released for use without full transparency into its code, training process, data, and fine-tuning methods.Â

On the other end of the spectrum, the researchers highlighted BLOOM, a large multilingual model developed by a collaboration of over 1,000 researchers worldwide, as an exemplar of true open-source AI. Every element of the model is freely accessible for inspection and further research.

The paper assessed some 30+ models (both text and image), but these demonstrate the immense variation within those that claim to be open-source:

BloomZ (BigScience): Fully open across all criteria, including code, training data, model weights, research papers, and API. Highlighted as an exemplar of truly open-source AI.
OLMo (Allen Institute for AI): Open code, training data, weights, and research papers. API only partially open.
Mistral 7B-Instruct (Mistral AI): Open model weights and API. Code and research papers only partially open. Training data unavailable.
Orca 2 (Microsoft): Partially open model weights and research papers. Code, training data, and API closed.
Gemma 7B instruct (Google): Partially open code and weights. Training data, research papers, and API closed. Described as â€œopenâ€ by Google rather than â€œopen sourceâ€.
Llama 3 Instruct (Meta): Partially open weights. Code, training data, research papers, and API closed. An example of an â€œopen weightâ€ model without fuller transparency.

A comprehensive breakdown of â€˜how open sourceâ€™ various AI models are. Source: ACM Digital Library (open access)

A lack of transparency

The lack of transparency surrounding AI models, especially those developed by large tech companies, raises serious concerns about accountability and oversight.

Without full access to the modelâ€™s code, training data, and other key components, it becomes extremely challenging to understand how these models work and make decisions. This makes it difficult to identify and address potential biases, errors, or misuse of copyrighted material.

Copyright infringement in AI training data is a prime example of the problems that arise from this lack of transparency. Many proprietary AI models, such as GPT-3.5/4/40/Claude 3/Gemini, are likely trained on copyrighted material.

However, since training data is kept under lock and key, identifying specific data within this material is nearly impossible.

The New York Timesâ€™s recent lawsuit against OpenAI demonstrates the real-world consequences of this challenge. OpenAI accused the NYT of using prompt engineering attacks to expose training data and coax ChatGPT into reproducing its articles verbatim, thus proving that OpenAIâ€™s training data contains copyright material.Â

â€œThe Times paid someone to hack OpenAIâ€˜s products,â€ stated OpenAI.

In response, Ian Crosby, the lead legal counsel for the NYT, said, â€œWhat OpenAI bizarrely mischaracterizes as â€˜hackingâ€™ is simply using OpenAIâ€™s products to look for evidence that they stole and reproduced The Timesâ€™ copyrighted works. And that is exactly what we found.â€

Indeed, this is just one example from a huge stack of lawsuits that are currently roadblocked partly due to AI modelsâ€™ opaque, impenetrable nature.

This is just the tip of the iceberg. Without robust transparency and accountability measures, we risk a future where unexplainable AI systems make decisions that profoundly impact our lives, economy, and society yet remain shielded from scrutiny.

Calls for openness

There have been calls for companies like Google and OpenAI to grant access to their modelsâ€™ inner-workings for the purposes of safety evaluation.

However, the truth is that even AI companies donâ€™t truly understand how their models work.Â

This is called the â€œblack boxâ€ problem, which arises when trying to interpret and explain the modelâ€™s specific decisions in a human-understandable way.

For example, a developer might know that a deep learning model is accurate and performs well, but they may struggle to pinpoint exactly which features the model uses to make its decisions.

Anthropic, which developed the Claude models, recently conducted an experiment to identify how Claude 3 Sonnet works, explaining, â€œWe mostly treat AI models as a black box: something goes in and a response comes out, and itâ€™s not clear why the model gave that particular response instead of another. This makes it hard to trust that these models are safe: if we donâ€™t know how they work, how do we know they wonâ€™t give harmful, biased, untruthful, or otherwise dangerous responses? How can we trust that theyâ€™ll be safe and reliable?â€

This experiment illustrated how AI developers donâ€™t fully understand the black box that is their AI models and that objectively explaining outputs is an exceptionally tricky task.

In fact, Anthropic estimated that it would consume more computing power to â€˜open the black boxâ€™ than to train the model itself!

Developers are attempting to actively combat the black-box problem through research like â€œExplainable AIâ€ (XAI), which aims to develop techniques and tools to make AI models more transparent and interpretable.

XAI methods seek to provide insights into the modelâ€™s decision-making process, highlight the most influential features, and generate human-readable explanations. XAI has already been applied to models deployed in high-stakes applications such as drug development, where understanding how a model works could be pivotal for safety.

Open-source initiatives are vital to XAI and other research that seeks to penetrate the black box and provide transparency into AI models.

Without access to the modelâ€™s code, training data, and other key components, researchers cannot develop and test techniques to explain how AI systems truly work and identify specific data they were trained on.

Regulations might confuse the open-source situation further

The European Unionâ€™s recently passed AI Act is set to introduce new regulations for AI systems, with provisions that specifically address open-source models.Â

Under the Act, open-source general-purpose models up to a certain size will be exempt from extensive transparency requirements.Â

However, as Dingemanse and Liesenfeld point out in their study, the exact definition of â€œopen source AIâ€ under the AI Act is still unclear and could become a point of contention.Â

The Act currently defines open source models as those released under a â€œfree and openâ€ license that allows users to modify the model. Still, it does not specify requirements around access to training data or other key components.

This ambiguity leaves room for interpretation and potential lobbying by corporate interests. The researchers warn that refining the open source definition in the AI Act â€œwill probably form a single pressure point that will be targeted by corporate lobbies and big companies.â€

There is a risk that without clear, robust criteria for what constitutes truly open-source AI, the regulations could inadvertently create loopholes or incentives for companies to engage in â€œopen-washingâ€ â€” claiming openness for the legal and public relations benefits while still keeping important aspects of their models proprietary.

Moreover, the global nature of AI development means differing regulations across jurisdictions could further complicate the landscape.Â

If major AI producers like the United States and China adopt divergent approaches to openness and transparency requirements, this could lead to a fragmented ecosystem in which the degree of openness varies widely depending on where a model originates.

The study authors emphasize the need for regulators to engage closely with the scientific community and other stakeholders to ensure that any open-source provisions in AI legislation are grounded in a deep understanding of the technology and the principles of openness.Â

As Dingemanse and Liesenfeld conclude in a discussion with Nature, â€œItâ€™s fair to say the term open source will take on unprecedented legal weight in the countries governed by the EU AI Act.â€

How this plays out in practice will have momentous implications for the future direction of AI research and deployment.

The post The rise of â€œopen sourceâ€ AI models: transparency and accountability in question appeared first on DailyAI.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

The rise of â€œopen sourceâ€ AI models: transparency and accountability in question

Clarifying how open open-source models really are

A lack of transparency

Calls for openness

Regulations might confuse the open-source situation further

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-48187 – RAGFlow Authentication Bypass

My biggest issue with Copilot+ isn’t Windows Recall, it’s that Microsoft is ignoring millions of Windows users

Microsoft Edge could get new ‘Copilot Mode’ which may change the way how you browse the Web

CVE-2025-39367 – SeventhQueen Kleo Missing Authorization Vulnerability

Zoom Fixes Critical Security Flaws Affecting Workplace Apps and SDK

How a Trust Center Solves Your Security Questionnaire Problem

CVE-2025-3458 – WordPress Ocean Extra Stored Cross-Site Scripting Vulnerability

DragonForce Claims to Be Taking Over RansomHub Ransomware Infrastructure

Did you know Microsoft Copilot can now help manage your text messages? But there’s one catch.

The rise of â€œopen sourceâ€ AI models: transparency and accountability in question

Clarifying how open open-source models really are

A lack of transparency

Calls for openness

Regulations might confuse the open-source situation further

Related Posts

The rise of â€œopen sourceâ€ AI models: transparency and accountability in question