August 23, 2024

Naga Vydyanathan

What is Open Source AI, Exactly?

Multimodal AI

Large Language Models

Data Science

Table of contents

H2 Only

H3 Only

Back in the early days of software, a visionary named Richard Stallman started the GNU project, igniting the open source movement that encouraged developers to share and collaborate on their code. Fast forward to today, and this open-source spirit has started penetrating into the world of AI. Tech giants like Google, Facebook, and Microsoft are opening their AI playbooks, joining independent developers in a collective effort to advance artificial intelligence. This trend is gaining momentum because transparency in AI isn't just beneficial—it's crucial. Open source AI aims to allow us to scrutinise, understand, and trust the algorithms shaping our future. However, is open source AI truly the same as open source software? While the principles of openness and collaboration remain, the complexities of AI introduce unique challenges. What exactly is open source AI, and why is it so important? Let’s delve into the distinctions and the compelling case for embracing open source in the AI domain.

OSI’s Open Source AI Definition

According to the latest draft of the Open Source AI definition by the Open Source Initiative, an open source AI system is one that is made available under terms that grant the “4 freedoms” of open source software:

‍

Freedom to Use: The system can be used for any purpose without seeking permission.
Freedom to Study: Users can study how the system works and inspect its components.
Freedom to Modify: The system can be modified for any purpose, including changing its output.
Freedom to Share: The system can be shared with others, with or without modifications, for any purpose.

‍

For traditional software, open code and clear, adequate documentation can easily grant these freedoms. However, the same might not hold true for AI systems that are far more complex. Can the principles of open source be seamlessly applied to AI, or do we need a new approach to accommodate its intricacies?

What is ‘Openness’ in AI?

The term, 'open', refers to systems that offer transparency, reusability, and extensibility. For traditional software, this often means providing open code and clear documentation, which are sufficient to achieve these goals. However, AI systems are more complex, involving additional layers such as model architectures, parameters, weights, training processes and datasets. This introduces a key difference between 'openness' in AI and in other types of software. Consequently, openness in AI must be defined individually for each of these layers.

The Open Source Initiative defines openness in AI systems with the following required and optional components of data, code and model, defined for each layer of an AI system - data, training, and inference. While OSI has classified the training dataset as an optional component, this decision has sparked considerable debate. Moreover, such a broad scope of what constitutes ‘openness’, raises the question of which AI systems can genuinely be considered open.

Required components	Legal frameworks
Data information
- Training methodologies and techniques	Available under OSD-compliant license
- Training data scope and characteristics	Available under OSD-compliant license
- Training data provenance (including how data was obtained and selected)	Available under OSD-compliant license
- Training data labeling procedures, if used	Available under OSD-compliant license
- Training data cleaning methodology	Available under OSD-compliant license
Code
- Data pre-processing	Available under OSI-approved license
- Training, validation and testing	Available under OSI-approved license
- Inference	Available under OSI-approved license
- Supporting libraries and tools	Available under OSI-approved license
Model
- Model architecture	Available under OSI-approved license
- Model parameters	Available under OSD-conformant terms

‍

Optional components	Legal frameworks
Data Information All data sets, including:	Available under OSD-compliant license
- Training data sets	Available under OSD-compliant license
- Testing data sets	Available under OSD-compliant license
- Validation data sets	Available under OSD-compliant license
- Benchmarking data sets	Available under OSD-compliant license
- Data card	Available under OSD-compliant license
- Evaluation data	Available under OSD-compliant license
- Evaluation results	Available under OSD-compliant license
- Other data documentation	Available under OSD-compliant license
Code
- Code used to perform inference for benchmark tests	Available under OSI-approved license
- Evaluation code	Available under OSI-approved license
Model All model elements, including:
- Model card	Available under OSD-compliant license
- Sample model outputs	Available under OSD-compliant license
- Model metadata	Available under OSD-compliant license
Other Any other documentation or tools produced or used, including:
- Research papers	Available under OSD-compliant license
- Technical report	Available under OSD-compliant license

The Case for Open Source AI - Does ‘Open’ AI Fulfil All of Its Promises?

The concept of open source has been a powerful force in the software industry, driving innovation, collaboration, and accessibility. Do these translate to AI systems too?

Does ‘Open’ AI Drive Innovation, Collaboration and Adoption?

Open-source AI fosters innovation by enabling widespread collaboration and idea-sharing. By making AI technologies accessible to a broader audience, it allows researchers, developers, and enthusiasts to contribute to and build upon existing work, accelerating the pace of innovation. This is evident in the rapid development of generative adversarial networks (GANs) and transformer models like BERT and GPT.

‍

TensorFlow and PyTorch have revolutionised AI research by providing powerful frameworks that simplify the development process. These frameworks allow researchers to focus on model architecture and data, rather than low-level infrastructure, leading to faster prototyping and experimentation.

‍

Additionally, these frameworks enable the customization of models for different domains and use cases. In healthcare, TensorFlow has been used to develop AI models for diagnosing diabetic retinopathy from retinal images, improving early detection and treatment. In the financial sector, PyTorch has facilitated the development of sophisticated models for fraud detection and algorithmic trading, providing real-time insights and decision-making capabilities.

‍

Some examples of innovation in AI through Open source are given below.

Project	Year	Innovation	Impact
Stable Diffusion	2022	Breakthrough in image generation using diffusion models	Enabled widespread experimentation and creative applications in art generation and image enhancement.
OpenAI's CLIP	2021	Combined language and image models for multimodal applications	Facilitated diverse applications, including image search engines and accessibility tools, by making the model and code publicly available.
Hugging Face's Transformers	2019-Present	Democratized access to state-of-the-art NLP models	Became a cornerstone for NLP research and development, allowing rapid experimentation and deployment of solutions across various domains.
Open Catalyst Project	2020-Present	Accelerating the discovery of new catalysts for clean energy applications using AI	Invited global collaboration, speeding up research and fostering innovation in sustainable energy solutions.
Mozilla's Common Voice	2017-Present	Platform for collecting and open-sourcing diverse voice data	Enabled the development of better and more inclusive speech recognition systems, benefiting applications from virtual assistants to accessibility tools.

Does ‘Open’ AI Democratize Technology?

Open-source AI strives to democratize technology by providing broader access to advanced tools and models, benefiting researchers, developers, and startups. Platforms such as Hugging Face’s Transformers and TensorFlow facilitate innovation by offering powerful frameworks that allow users to experiment and build on state-of-the-art technologies. This increased accessibility lowers barriers to entry and promotes collaboration, enabling faster advancements and diverse applications.

‍

Despite these benefits, significant challenges remain. Developing and deploying large-scale AI systems requires substantial computational resources and expertise, which can limit widespread access and participation. Moreover, the process of building and training AI models involves important editorial decisions that affect performance and ethical considerations. Open-source access does not always address these disparities or the complexity involved in making these choices.

‍

Additionally, while open-source AI can enhance inclusivity and innovation, it does not inherently reduce the costs associated with deploying AI systems at scale. The practical challenges of managing infrastructure and operational expenses persist, making it difficult for many organizations to fully leverage open-source technologies in production environments.

Does ‘Open’ AI Enable ‘Safe’ and ‘Ethical’ AI?

Transparency is a key principle of open AI, fostering trust and accountability by allowing scrutiny of algorithms, data, and training processes. Open-source AI projects provide access to AI models, enabling public examination to ensure adherence to ethical standards and alignment with societal values. Additionally, open-source AI enhances safety through diverse contributions that identify and address risks, leading to more secure and robust systems. For example, initiatives like TrustyAI and InstructLab are working to improve AI alignment and safety through collaborative, open-source approaches.

‍

However, transparency has its limits. Unlike traditional software, open-source AI does not always fully reveal how models will perform in specific contexts or predict their emergent properties. Issues like "hallucinations," where AI generates inaccurate or misleading information, underscore that innovation without security is simply "risk." Comprehensive understanding of these complexities often requires more than just access to code and documentation.

What is not open in today’s open source AI systems?

‍

Despite the open-source nature of many AI projects, several critical elements often remain inaccessible:

‍

Training Data

OpenAI's GPT-3 and GPT-4, while accessible through APIs, do not disclose their training datasets, which limits understanding of the models' knowledge base and raises concerns about the diversity and representativeness of the data used. Similarly, Meta’s Large Language Model, LLaMA, has been released with open weights, yet the specifics of the training datasets remain undisclosed, restricting complete transparency and insight into the data that informed its development.

Model Weights and Parameters

Google’s BERT model is open-source in terms of its architecture, but fine-tuned versions or specific variants used in Google’s applications often do not have their weights or parameters publicly available. This limitation hampers the ability to fully replicate or build upon Google’s proprietary enhancements. Similarly, while DeepMind’s AlphaFold model for protein folding is available, some fine-tuned variants and subsequent improvements from ongoing research are not fully disclosed, affecting the model’s replicability and the ability to leverage these advancements comprehensively.

Training Processes

Hugging Face’s Transformers library offers pre-trained models and code and regularly updates them, detailed specifics of training procedures, hyperparameters, and computational resources used are often not fully disclosed, which can impact replication and further experimentation. Similarly, while OpenAI’s CLIP model provides its core architecture and code, the exact training procedures and hyperparameters are not fully published, limiting insights into the model’s development process.

Proprietary Enhancements

Databricks' MLflow, an open-source platform for managing the machine learning lifecycle, offers extensive features, but Databricks' commercial versions may include proprietary enhancements and optimizations not available in the open-source release. Similarly, while EleutherAI’s GPT-Neo models are open-source, there may be additional proprietary modifications or enhancements developed for specific applications that are not fully disclosed. This disparity between open-source and proprietary versions limits the ability to fully understand and replicate the most advanced features and optimizations.

Model Performance in Different Contexts

OpenAI’s Codex, which powers GitHub Copilot, is open-source to a certain extent; however, performance details and specific behaviors across diverse coding environments or use cases are not fully transparent. Similarly, while Anthropic’s Claude provides general design principles and an overview of its architecture, detailed performance metrics and evaluations in varied real-world scenarios are not always fully disclosed. This lack of comprehensive transparency can hinder the ability to fully assess and understand the models' real-world effectiveness and adaptability.

Ethical and Safety Considerations

Google DeepMind’s Gato model is open-source, yet detailed ethical assessments and safety evaluations are often not fully available, leaving potential biases and risks less transparent. Similarly, while OpenAI’s DALL-E provides access to its core technology, comprehensive evaluations of its biases, safety protocols, and potential risks are not fully detailed in the public domain. This lack of thorough disclosure can obscure critical insights into the models' safety and ethical implications.

So, What’s the Conclusion?

Open-source AI is undeniably transforming the tech landscape, fueling innovation and expanding access like never before. However, for its full potential to be realised, we must confront its current limitations. While transparency in AI models fosters trust, many projects still keep key details about training data, performance, and safety under wraps. Additionally, true democratisation hinges on making both code and compute resources widely accessible—a challenge that remains. Moving forward, embracing greater openness and ensuring equitable access to computing power will be crucial in making AI not only groundbreaking but also fair, secure, and beneficial for everyone.

‍

Multimodal AI

Large Language Models

Data Science

Multimodal AI

Large Language Models

Data Science

The latest industry news, interviews, technologies, and resources.