FLOSS Research
Open Data and Open Source AI: Charting a course to get more of both
While working to define Open Source AI, we realized that data governance is an unresolved issue. The Open Source Initiative organized a workshop to discuss data sharing and governance for AI training. The critical question posed to attendees was “How can we best govern and share data to power Open Source AI?” The main objective of this workshop was to establish specific approaches and strategies for both Open Source AI developers and other stakeholders.
The Workshop: Building bridges across “Open” streamsHeld on October 10-11, 2024, and hosted by Linagora’s Villa Good Tech, the OSI workshop brought together 20 experts from diverse fields and regions. Funded by the Alfred P. Sloan Foundation, the event focused on actionable steps to align open data practices with the goals of Open Source AI.
Participants, listed below, comprised academics, civil society leaders, technologists, and representatives from organizations like Mozilla Foundation, Creative Commons, EleutherAI Institute and others.
- Ignatius Ezeani University of Lancaster / Nigeria
- Masayuki Hatta Debian, Open Source Group Japan / Japan
- Aviya Skowron EleutherAI Institute / Poland
- Stefano Zacchiroli Software Heritage / Italy
- Ricardo Torres Digital Public Goods Alliance / Mexico
- Kristina Podnar Data and Trust Alliance / Croatia + USA
- Joana Varon Coding Rights / Brazil
- Renata Avila Open Knowledge Foundation / Guatemala
- Alek Tarkowski Open Future / Poland
- Maximilian Gantz Mozilla Foundation / Germany
- Stefaan Verhulst GovLab / USA/Belgium
- Paul Keller Open Future / Germany
- Thom Vaughan Common Crawl / UK
- Julie Hunter Linagora / USA
- Deshni Govender GIZ FAIR Forward – AI for All / South Africa
- Ramya Chandrasekhar CNRS / India
- Anna Tumadóttir Creative Commons / Iceland
- Stefano Maffulli Open Source Initiative / Italy
Over two days, the group worked to frame a cohesive approach to data governance. Alek Tarkowski and Paul Keller of the Open Future Foundation are working with OSI to complete the white paper summarizing the group’s work. In the meantime, here is a quick “tease”—just a few of the many topics that the group discussed:
The streams of “open” merge, creating wavesAI is where Open Source software, open data, open knowledge, and open science meet in a new way. Since OpenAI released ChatGPT, what once were largely parallel tracks with occasional junctures are now a turbulent merger of streams, creating ripples in all of these disciplines and forcing us to reassess our principles: How do we merge these streams without eroding the principles of transparency and access that define openness?
We discovered in the process of defining Open Source AI that the basic freedoms we’ve put in the Open Source Definition and its foundation, the Free Software Definition, are still good and relevant. Open Source software has had decades to mature into a structured ecosystem with clear rules, tools, and legal frameworks. Same with Open Knowledge and Open Science: While rooted in age-old traditions, open knowledge and science have seen modern rejuvenation through platforms like Wikipedia and the Open Knowledge Foundation. Open data, however, feels less solid: often serving as a one-way pipeline from public institutions to private profiteers, is now dragged into a whole new territory.
How are these principles of “open” interacting with each other, how are we going to merge Open Data with Open Source with Open Science and Open Knowledge in Open Source AI?
The broken social contract of dataData fuels AI. The sheer scale of data required to train models like ChatGPT reveals not just a technological challenge but also a societal dilemma. Much of this data comes from us—the blogs we write, the code we share, the information we give freely to platforms.
OpenAI, for example, “slurps” all the data it can find, and much of it is what we willingly give: the blogs we write; the code we share; the pictures, emails and address books we keep in “the cloud”; and all the other information we give freely to platforms.
We, the people, make the “data,” but what are we getting in exchange? OpenAI owns and controls the machine built with our data, and it grants us access via API, until it changes its mind. We are essentially being stripmined for a proprietary system that grants access at a price—until the owner decides otherwise.
We need a different future, one where data empowers communities, not just corporations. That starts with revisiting the principles of openness that underpin the open source, open science, and open knowledge movements. The question is: How do we take back control?
Charting a path forwardWe want the machine for ourselves. We want machines that the people can own and control. We need to find a way to swing the pendulum back to our meaning of Open. And it’s all about the “data.”
The OSI’s work on the Open Source AI Definition provides a starting point. An Open Source AI machine is one that the people can meaningfully fork without having to ask for permission. For AI to truly be open, developers need access to the same tools and data as the original creators. That means transparent training processes, open filtering code, and, critically, open datasets.
Group photo of the participants to the workshop on data governance, Paris, Oct 2024. Next stepsThe white paper, expected in December, will synthesize the workshop’s discussions and propose concrete strategies for data governance in Open Source AI. Its goal is to lay the groundwork for an ecosystem where innovation thrives without sacrificing openness or equity.
As the lines between “open” streams continue to blur, the choices we make now will define the future of AI. Will it be a tool controlled by a few, or a shared resource for all?
The answer lies in how we navigate the waves of data and openness. Let’s get it right.
The Open Source Initiative and the Eclipse Foundation to Collaborate on Shaping Open Source AI (OSAI) Public Policy
BRUSSELS and WEST HOLLYWOOD, Calif. – 14 November 2024 – The Eclipse Foundation, one of the world’s largest open source foundations, and the Open Source Initiative (OSI), the global non-profit educating about and advocating for the benefits of open source and steward of the Open Source Definition, have signed a Memorandum of Understanding (MOU) to collaborate on promoting the interest of the open source community in the implementation of regulatory initiatives on Open Source Artificial Intelligence (OSAI). This agreement underscores the two organisations’ shared commitment to ensuring that emerging AI regulations align with widely recognised OSI open source definitions and open source values and principles.
“AI is arguably the most transformative technology of our generation,” said Stefano Maffulli, executive director, Open Source Initiative. “The challenge now is to craft policies that not only foster growth of AI but ensure that Open Source AI thrives within this evolving landscape. Partnering with the Eclipse Foundation and its expertise, with its experience in European open source development and regulatory compliance, is important to shape the future of Open Source AI.”
“For decades, OSI has been the ‘gold standard’ the open source community has turned to for building consensus around important issues,” said Mike Milinkovich, executive director of the Eclipse Foundation. “As AI reshapes industries and societies, there is no more pressing issue for the open source community than the regulatory recognition of open source AI systems. Our combined expertise – OSI’s global leadership in open standards and open source licences and our extensive work with open source regulatory compliance – makes this partnership a powerful advocate for the design and implementation of sound AI policies worldwide.”
Addressing the Global Challenges of AI Regulation
With AI regulation on the horizon in multiple regions, including the EU, both organisations recognise the urgency of helping policymakers understand the unique challenges and opportunities of OSAI technologies. The rapid evolution of AI technologies, together with new, upcoming complex regulatory landscapes, demand clear, consistent, and aligned guidance rooted in open source principles.
Through this partnership, the Eclipse Foundation and OSI will endeavour to bring clarity in language and terms that industry, community, civil society, and policymakers can rely upon as public policy is drafted and enforced. The organisations will collaborate by leveraging their respective public platforms and events to raise awareness and advocate on the topic. Additionally, they will work together on joint publications, presentations, and other promotional activities, while also assisting one another in educating government officials on policy considerations for OSAI and General Purpose AI (GPAI). Through this partnership, they aim to provide clear, consistent guidance that aligns with open source principles.
Key Areas of Collaboration
The MoU outlines several areas of cooperation, including:
- Information Exchange: OSI and the Eclipse Foundation will share relevant insights and information related to public policy-making and regulatory activities on artificial intelligence.
- Representation to Policymakers: OSI and the Eclipse Foundation will cooperate in representing the principles and values of open source licences to policymakers and civil society organisations.
- Promotion of Open Source Principles: Joint efforts will be made to raise awareness of the role of open source in AI, emphasising how it can foster innovation while mitigating risks.
A Partnership for the Future
As AI continues to revolutionise industries worldwide, the need for thoughtful, balanced regulation is critical. The OSI and Eclipse Foundation are committed to providing the open source community, industry leaders, and policymakers with the tools and knowledge they need to navigate this rapidly evolving field.
This MoU marks the very beginning of a long-term collaboration, with joint initiatives and activities to be announced throughout the remainder of 2024 and into 2025.
About the Eclipse Foundation
The Eclipse Foundation provides our global community of individuals and organisations with a business-friendly environment for open source software collaboration and innovation. We host the Eclipse IDE, Adoptium, Software Defined Vehicle, Jakarta EE, and over 420 open source projects, including runtimes, tools, specifications, and frameworks for cloud and edge applications, IoT, AI, automotive, systems engineering, open processor designs, and many others. Headquartered in Brussels, Belgium, the Eclipse Foundation is an international non-profit association supported by over 385 members. To learn more, follow us on social media @EclipseFdn, LinkedIn, or visit eclipse.org.
About the Open Source Initiative
Founded in 1998, the Open Source Initiative (OSI) is a non-profit corporation with global scope formed to educate about and advocate for the benefits of Open Source and to build bridges among different constituencies in the Open Source community. It is the steward of the Open Source Definition, setting the foundation for the global Open Source ecosystem. Join and support the OSI mission today at https://opensource.org/join.
Third-party trademarks mentioned are the property of their respective owners.
###
Media contacts:
Schwartz Public Relations (Germany)
Gloria Huppert/Marita Bäumer
Sendlinger Straße 42A
80331 Munich
EclipseFoundation@schwartzpr.de
+49 (89) 211 871 -70/ -62
514 Media Ltd (France, Italy, Spain)
Benoit Simoneau
benoit@514-media.com
M: +44 (0) 7891 920 370
Nichols Communications (Global Press Contact)
Jay Nichols
jay@nicholscomm.com
+1 408-772-1551
ClearlyDefined v2.0 adds support for LicenseRefs
One of the major focuses of the ClearlyDefined Technical Roadmap is the improvement in the quality of license data. As such, we are excited to announce the release of ClearlyDefined v2.0 which adds over 2,000 new well-known licenses it can identify. You can see the complete list of new non-SPDX licenses in ScanCode LicenseDB.
A little historical background, when Clearly Defined was first created, it was initially decided to limit the reported licenses to only those on the SPDX License List. As teams worked with the Clearly Defined data, it became clear that additional license discovery is important to give users a fuller picture of the projects they depend on. In previous releases of ClearlyDefined, licenses not on the SPDX License List were represented in the definition as NOASSERTION or OTHER. (See the breakdown of licenses in The most popular licenses for each language in 2023.)The v2.0 release of ClearlyDefined includes an update of ScanCode to v32 and the support of LicenseRefs to identify non-SPDX licenses. The license in the definition will now be a LicenseRef with prefix LicenseRef-scancode- if ScanCode identifies a non-SPDX license. This improves the license coverage in the ClearlyDefined definitions and consumers ability to accurately construct license compliance policies.
ClearlyDefined identifies licenses in definitions using SPDX expressions. The SPDX specification has a way to include non-SPDX licenses in license expressions.
A license expression could be a single license identifier found on the SPDX License List; a user defined license reference denoted by the LicenseRef-[idString]; a license identifier combined with an SPDX exception; or some combination of license identifiers, license references and exceptions constructed using a small set of defined operators (e.g., AND, OR, WITH and +)
— excerpt from SPDX Annexes: SPDX license expressions
Example change of a definition:
CoordinatesLicense BEFORELicense AFTERnpm/npmjs/@alexa-games/sfb-story-debugger/2.1.0NOASSERTIONLicenseRef-.amazon.com.-AmznSL-1.0Note: ClearlyDefined v2.0 also includes an update to ScanCode v32.
What does this mean for definitions?This section includes a simplified description of what happens when you request a definition from ClearlyDefined. These examples only refer to the ScanCode tool. Other tools are run as well and are handled in similar ways.
When the definition already existsAny request for a definition through the /definitions API makes a couple of checks before returning the definition:
If the definition exists, it checks whether the definition was created using the latest version of the ClearlyDefined service.
- If yes, it returns the definition as is.
- If not, it recomputes the definition using the existing raw results from the tools run during the previous harvest for the existing definition. In this case, the tool version will be earlier than ScanCode v32.
NOTE: ClearlyDefined does not support LicenseRefs from ScanCode prior to v32. For earlier versions of ScanCode, ClearlyDefined stores any LicenseRefs as NOASSERTION. In some cases, you may see OTHER when the definition was curated.
When the definition does not existIf the definition does not exist:
- It will send a harvest request which will run the latest version of all the tools and produce raw results.
- From these raw results, it will compute a definition which might include a LicenseRef.
If you see NOASSERTION in the license expression, you can check the definition to determine the version of ScanCode in the “described”: “tools” section.
If ScanCode is a version earlier than v32, you can submit a harvest API request. This will run any tools for which ClearlyDefined now supports a later version. Once the tools complete, the definition will be recomputed based on the new results.
In some cases, even when the results are from ScanCode v32, you may still see NOASSERTION. Reharvesting when the ScanCode version is already v32 will not change the definition.
What does this mean for tools?When adding ScanCode licenses to allow/deny lists, note the ScanCode LicenseDB lists licenses without the LicenseRef prefix. All LicenseRefs coming from ScanCode will start with LicenseRef-scancode-.
Tools using an Allow ListA recomputed definition may change the license to include a LicenseRef that you want to allow. All new LicenseRefs that are acceptable will need to be added to your allow list. We are taking the approach of adding them as they appear in flagged package-version licenses. An alternative is to review the ScanCode LicenseDB to proactively add LicenseRefs to your allow list.
Tools using a Deny ListDeny lists need to be exhaustive to prevent a new license from being allowed by default. It is recommended that you review the ScanCode LicenseDB to determine if there are LicenseRefs you want to add to the deny list.
Note: The SPDX License List also changes over time. A periodic review to maintain the Deny list is always a good idea.
Providing FeedbackAs with any major version change, there can be unexpected behavior. You can reach out with questions, feedback, or requests. Find how to get in touch with us in the Get Involved doc.
If you have comments or questions on the actual LicenseRefs, you should reach out to ScanCode License DB maintainers.
AcknowledgementsA huge thank you to the contributing developers and their organizations for supporting the work of ClearlyDefined.
In alphabetical order, contributors were…
- ajhenry (GitHub)
- brifl (Microsoft)
- elrayle (GitHub)
- jeff-luszcz (GitHub)
- ljones140 (GitHub)
- lumaxis (GitHub)
- mpcen (Microsoft)
- nickvidal (Open Source Initiative)
- qtomlinson (SAP)
- RomanIakovlev (GitHub)
- yashkohli88 (SAP)
See something you’d like ClearlyDefined to do or could do better? If you have resources to help out, we have work to be done to further improve data quality, performance, and sustainability. We’d love to hear from you.
ReferencesClearlyDefined at SOSS Fusion 2024: a collaborative solution to Open Source license compliance
This past month, the Open Source Security Foundation (OpenSSF) hosted SOSS Fusion in Atlanta, an event that brought together a diverse community of leaders and innovators from across the digital security spectrum. The conference, held on October 22-23, explored themes central to today’s technological landscape: AI security, diversity in technology, and public policy for Open Source software. Industry thought leaders like Bruce Schneier, Marten Mickos, and Cory Doctorow delivered keynotes, setting the tone for a conference that emphasized collaboration and community in creating a secure digital future.
Amidst these pressing topics, the Open Source Initiative in collaboration with GitHub and SAP presented ClearlyDefined—an innovative project aimed at simplifying software license compliance and metadata management. Presented by Nick Vidal of the Open Source Initiative, along with E. Lynette Rayle from GitHub and Qing Tomlinson from SAP, the session highlighted how ClearlyDefined is transforming the way organizations handle licensing compliance for Open Source components.
What is ClearlyDefined?ClearlyDefined is a project with a powerful vision: to create a global crowdsourced database of license metadata for every software component ever published. This ambitious mission seeks to help organizations of all sizes easily manage compliance by providing accurate, up-to-date metadata for Open Source components. By offering a single, reliable source for license information, ClearlyDefined enables organizations to work together rather than in isolation, collectively contributing to the metadata that keeps Open Source software compliant and accessible.
The problem: redundant and inconsistent license managementIn today’s Open Source ecosystem, managing software licenses has become a significant challenge. Many organizations face the repetitive task of identifying, correcting, and maintaining accurate licensing data. When one component has missing or incorrect metadata, dozens—or even hundreds—of organizations using that component may duplicate efforts to resolve the same issue. ClearlyDefined aims to eliminate redundancy by enabling a collaborative approach.
The solution: crowdsourcing compliance with ClearlyDefinedClearlyDefined provides an API and user-friendly interface that make it easy to access and contribute license metadata. By aggregating and standardizing licensing data, ClearlyDefined offers a powerful solution for organizations to enhance SBOMs (Software Bill of Materials) and license information without the need for extensive re-scanning and data correction. At the conference, Nick demonstrated how developers can quickly retrieve license data for popular libraries using a simple API call, making license compliance seamless and scalable.
In addition, organizations that encounter incomplete or incorrect metadata can easily update it through ClearlyDefined’s platform, creating a feedback loop that benefits the entire Open Source community. This crowdsourcing approach means that once an organization fixes a licensing issue, that data becomes available to all, fostering efficiency and accuracy.
Key components of ClearlyDefined’s platform1. API and User Interface: Users can access ClearlyDefined data through an API or the website, making it simple for developers to integrate license checks directly into their workflows.
2. Human curation and community collaboration: To ensure high data quality, ClearlyDefined employs a curation workflow. When metadata requires updates, community members can submit corrections that go through a human review process, ensuring accuracy and reliability.
3. Integration with popular package managers: ClearlyDefined supports various package managers, including npm and pypi, and has recently expanded to support Conda, a popular choice among data science and AI developers.
Real-world use cases: GitHub and SAP’s adoption of ClearlyDefinedDuring the presentation, representatives from GitHub and SAP shared how ClearlyDefined has impacted their organizations.
– GitHub: ClearlyDefined’s licensing data powers GitHub’s compliance solutions, allowing GitHub to manage millions of licenses with ease. Lynette shared how they initially onboarded over 17 million licenses through ClearlyDefined, a number that has since grown to over 40 million. This database enables GitHub to provide accurate compliance information to users, significantly reducing the resources required to maintain licensing accuracy. Lynette showcased the harvesting process and the curation process. More details about how GitHub is using ClearlyDefined is available here.
– SAP: Qing discussed how ClearlyDefined’s approach has streamlined SAP’s Open Source compliance efforts. By using ClearlyDefined’s data, SAP reduced the time spent on license reviews and improved the quality of metadata available for compliance checks. SAP’s internal harvesting service integrates with ClearlyDefined, ensuring that critical license metadata is consistently available and accurate. SAP has contributed to the ClearlyDefined project and most notably, together with Microsoft, has optimized the database schema and reduced the database operational cost by more than 90%. More details about how SAP is using ClearlyDefined is available here.
Why ClearlyDefined mattersClearlyDefined is a community-driven initiative with a vision to address one of Open Source’s biggest challenges: ensuring accurate and accessible licensing metadata. By centralizing and standardizing this data, ClearlyDefined not only reduces redundant work but also fosters a collaborative approach to license compliance.
The platform’s Open Source nature and integration with existing package managers and APIs make it accessible and scalable for organizations of all sizes. As more contributors join the effort, ClearlyDefined continues to grow, strengthening the Open Source community’s commitment to compliance, security, and transparency.
Join the ClearlyDefined communityClearlyDefined is always open to new contributors. With weekly developer meetings, an open governance model, and continuous collaboration with OpenSSF and other Open Source organizations, ClearlyDefined provides numerous ways to get involved. For anyone interested in shaping the future of license compliance and data quality in Open Source, ClearlyDefined offers an exciting opportunity to make a tangible impact.
At SOSS Fusion, ClearlyDefined’s presentation showcased how an open, collaborative approach to license compliance can benefit the entire digital ecosystem, embodying the very spirit of the conference: working together toward a secure, inclusive, and sustainable digital future.
Download slides and see summarized presentation transcript below.
ClearlyDefined presentation transcriptHello, folks, good morning! Let’s start by introducing ClearlyDefined, an exciting project. My name is Nick Vidal, and I work with the Open Source Initiative. With me today are Lynette Rayle from GitHub and Qing Tomlinson from SAP, and we’re all very excited to be here.
Introduction to ClearlyDefined’s mission
So, what’s the mission of ClearlyDefined? Our mission is ambitious—we aim to crowdsource a global database of license metadata for every software component ever published. This would benefit everyone in the Open Source ecosystem.
The problem ClearlyDefined addresses
There’s a critical problem in the Open Source space: compliance and managing SBOMs (Software Bill of Materials) at scale. Many organizations struggle with missing or incorrect licensing metadata for software components. When multiple organizations use a component with incomplete or wrong license metadata, they each have to solve it individually. ClearlyDefined offers a solution where, instead of every organization doing redundant work, we can collectively work on fixing these issues once and make the corrected data available to all.
ClearlyDefined’s solution
ClearlyDefined enables organizations to access license metadata through a simple API. This reduces the need for repeated license scanning and helps with SBOM generation at scale. When issues arise with a component’s license metadata, organizations can contribute fixes that benefit the entire community.
Getting started with ClearlyDefined
To use ClearlyDefined, you can access its API directly from your terminal. For example, let’s say you’re working with a JavaScript library like Lodash. By calling the API, you can get all license metadata for a specific version of Lodash at your fingertips.
Once you incorporate this licensing metadata into your workflow, you may notice some metadata that needs updating. You can curate that data and contribute it back, so everyone benefits. ClearlyDefined also provides a user-friendly interface for this, making it easier to contribute.
Open Source and community contributions
ClearlyDefined is an Open Source initiative, hosted on GitHub, supporting various package managers (e.g., npm, pypi). We work to promote best practices and integrate with other tools. Recently, we’ve expanded our scope to support non-SPDX licenses and Conda, a package manager often used in data science projects.
Integration with other tools
ClearlyDefined integrates with GUAC, an OpenSSF project that consumes ClearlyDefined data. This integration broadens the reach and utility of ClearlyDefined’s licensing information.
Case studies and community impact
I’d like to hand it over to Lynette from GitHub, who will talk about how GitHub uses ClearlyDefined and why it’s critical for license compliance.
GitHub’s use of ClearlyDefined
Hello, I’m Lynette, a developer at GitHub working on license compliance solutions. ClearlyDefined has become a key part of our workflows. Knowing the licenses of our dependencies is crucial, as legal compliance requires correct attributions. By using ClearlyDefined, we’ve streamlined our process and now manage over 40 million licenses. We also run our own harvester to contribute back to ClearlyDefined and scale our operations.
SAP’s adoption of ClearlyDefined
Hi, my name is Qing. At SAP, we co-innovate and collaborate with Open Source, ensuring a clean, well-maintained software pool. ClearlyDefined has streamlined our license review process, reducing time spent on scanning and enhancing data quality. SAP’s journey with ClearlyDefined began in 2018, and since then, we’ve implemented large-scale automation for our Open Source compliance and continuously contribute curated data back to the community.
Community and governance
ClearlyDefined thrives on community involvement. We recently elected members to our Steering and Outreach Committees to support the platform and encourage new contributors. Our weekly developer meetings and active Discord channel provide opportunities to engage, share knowledge, and collaborate.
Q&A highlights
- PURLs as Package Identifiers: We’re exploring support for PURLs as an internal coordinate system.
- Data Quality Issues: Data quality is our top priority. We plan to implement routines to scan for common issues, ensuring accurate metadata across the platform.
Thank you all for joining us today. If you’re interested in contributing, please reach out and become part of this collaborative community.
Members Newsletter – November 2024
After more than two years of collaboration, information gathering, global workshopping, testing, and an in-depth co-design process, we have an Open Source AI Definition.
The purpose of version 1.0 is to establish a workable standard for developers, researchers, and educators to consider how they may design evaluations for AI systems’ openness. The meaningful ability to fork and control their AI will foster permissionless, global innovation. It was important to drive a stake in the ground so everyone has something to work with. It’s version 1.0, so going forward, the process allows for improvement, and that’s exactly what will happen.
Over 150 individuals were part of the OSAID forum, nearly 15K subscribers to the OSI newsletter were kept up-to-date with the latest news about the OSAID, 2M unique visitors to the OSI website were exposed to the OSAID process. There were 50+ co-design working group volunteers representing 29 countries, including participants from Africa, Asia, Europe, and the Americas.
Future versions of OSAID will continue to be informed by the feedback we receive from various stakeholder communities. The fundamental principles and aim will not change, but, as our (collective) understanding of the technology improves and technology itself evolves, we might need to update to clarify or even change certain requirements. To enable this, the OSI Board voted to establish an AI sub-committee who will develop appropriate mechanisms for updating the OSAID in consultation with stakeholders. It will be fully formed in the months ahead.
Please continue to stay involved, as diverse voices and experiences are required to ensure Open Source AI works for the good of us all.
Stefano Maffulli
Executive Director, OSI
I hold weekly office hours on Fridays with OSI members: book time if you want to chat about OSI’s activities, if you want to volunteer or have suggestions.
News from the OSI The Open Source Initiative Announces the Release of the Industry’s First Open Source AI DefinitionOpen and public co-design process culminates in a stable version of Open Source AI Definition, ensures freedoms to use, study, share and modify AI systems.
Other highlights:
- How we passed the AI conundrums
- ClearlyDefined at SOSS Fusion 2024
- ClearlyDefined’s Steering and Outreach Committees Defined
- The Open Source Initiative Supports the Open Source Pledge
Article from ZDNet
For 25 years, OSI’s definition of open-source software has been widely accepted by developers who want to build on each other’s work without fear of lawsuits or licensing traps. Now, as AI reshapes the landscape, tech giants face a pivotal choice: embrace these established principles or reject them.
Other highlights:
- The Gap Between Open and Closed AI Models Might Be Shrinking. Here’s Why That Matters (Time)
- Meta’s military push is as much about the battle for open-source AI as it is about actual battles (Fortune)
- OSI unveils Open Source AI Definition 1.0 (InfoWorld)
- We finally have an ‘official’ definition for open source AI (TechCrunch)
- Read all press mentions from this past month
News from OSI affiliates:
- OpenSSF: SOSS Fusion 2024: Uniting Security Minds for the Future of Open Source (Security Boulevard)
- Mozilla Foundation: How Mozilla’s President Defines Open-Source AI (Forbes)
News from OpenSource.net:
- OpenSource.Net turns one with a redesign
- How to make reviewing pull requests a better experience
- Closing the Gap: Accelerating environmental Open Source
The State of Open Source Survey
In collaboration with the Eclipse Foundation and Open Source Initiative (OSI).
JobsLead OSI’s public policy agenda and education.
Bloomberg is seeking a Technical Architect to join their OSPO team.
EventsUpcoming events:
- Nerdearla Mexico (November 7-9, 2024 – Mexico City)
- SeaGL (November 8-9, 2024 – Seattle)
- SFSCON (November 8-9, 2024 – Bolzano)
- KubeCon + CloudNativeCon North America (November 12-15, 2024 – Salt Lake City)
- OpenForum Academy Symposium (November, 13-14, 2024 – Boston)
- The Linux Foundation Legal Summit (November 18-19, 2024 – Napa)
- The Linux Foundation Member Summit (November 19-21, 2024 – Napa)
- Open Source Experience (December 4-5 – Paris)
- KubeCon + CloudNativeCon India (December 11-12, 2024 – Delhi)
- EU Open Source Policy Summit (January 31, 2025 – Brussels)
- FOSDEM (February 1-2, 2025 – Brussels)
CFPs:
- FOSDEM 2025 EU-Policy Devroom – event being organized by the OSI, OpenForum Europe, Eclipse Foundation, The European Open Source Software Business Association, the European Commission Open Source Programme Office, and the European Commission.
- PyCon US 2025: the Python Software Foundation kicks off Website, CfP, and Sponsorship!
- GitHub
Interested in sponsoring, or partnering with, the OSI? Please see our Sponsorship Prospectus and our Annual Report. We also have a dedicated prospectus for the Deep Dive: Defining Open Source AI. Please contact the OSI to find out more about how your company can promote open source development, communities and software.
Get to vote for the OSI Board by becoming a memberLet’s build a world where knowledge is freely shared, ideas are nurtured, and innovation knows no bounds!
The Open Source Initiative Announces the Release of the Industry’s First Open Source AI Definition
RALEIGH, N.C., Oct. 28, 2024 — ALL THINGS OPEN 2024 — After a year-long, global, community design process, the Open Source Definition (OSAID) v.1.0 is available for public use.
The release of version 1.0 was announced today at All Things Open 2024, an industry conference focused on common issues of interest to the worldwide Open Source community. The OSAID offers a standard by which community-led, open and public evaluations will be conducted to validate whether or not an AI system can be deemed Open Source AI. This first stable version of the OSAID is the result of multiple years of research and collaboration, an international roadshow of workshops, and a year-long co-design process led by the Open Source Initiative (OSI), globally recognized by individuals, companies and public institutions as the authority that defines Open Source.
“The co-design process that led to version 1.0 of the Open Source AI Definition was well-developed, thorough, inclusive and fair,” said Carlo Piana, OSI board chair. “It adhered to the principles laid out by the board, and the OSI leadership and staff followed our directives faithfully. The board is confident that the process has resulted in a definition that meets the standards of Open Source as defined in the Open Source Definition and the Four Essential Freedoms, and we’re energized about how this definition positions OSI to facilitate meaningful and practical Open Source guidance for the entire industry.”
“The new definition requires Open Source models to provide enough information about their training data so that a ‘skilled person can recreate a substantially equivalent system using the same or similar data,’ which goes further than what many proprietary or ostensibly Open Source models do today,” said Ayah Bdeir, who leads AI strategy at Mozilla. “This is the starting point to addressing the complexities of how AI training data should be treated, acknowledging the challenges of sharing full datasets while working to make open datasets a more commonplace part of the AI ecosystem. This view of AI training data in Open Source AI may not be a perfect place to be, but insisting on an ideologically pristine kind of gold standard that will not actually be met by any model builder could end up backfiring.”
“We welcome OSI’s stewardship of the complex process of defining Open Source AI,” said Liv Marte Nordhaug, CEO of the Digital Public Goods Alliance (DPGA) secretariat. “The Digital Public Goods Alliance secretariat will build on this foundational work as we update the DPG Standard as it relates to AI as a category of DPGs.”
“Transparency is at the core of EleutherAI’s non-profit mission. The Open Source AI Definition is a necessary step towards promoting the benefits of Open Source principles in the field of AI,” said Stella Biderman, executive director at the EleutherAI Institute. “We believe that this definition supports the needs of independent machine learning researchers and promotes greater transparency among the largest AI developers.”
“Arriving at today’s OSAID version 1.0 was a difficult journey, filled with new challenges for the OSI community,” said OSI Executive Director, Stefano Maffulli. “Despite this delicate process, filled with differing opinions and uncharted technical frontiers—and the occasional heated exchange—the results are aligned with the expectations set out at the start of this two-year process. This is a starting point for a continued effort to engage with the communities to improve the definition over time as we develop with the broader Open Source community the knowledge to read and apply OSAID v.1.0.”
The text of the OSAID v.1.0 as well as a partial list of the many global stakeholders who endorse the definition can be found here: https://opensource.org/ai
About the Open Source Initiative
Founded in 1998, the Open Source Initiative (OSI) is a non-profit corporation with global scope formed to educate about and advocate for the benefits of Open Source and to build bridges among different constituencies in the Open Source community. It is the steward of the Open Source Definition and the Open Source AI Definition, setting the foundation for the global Open Source ecosystem. Join and support the OSI mission today at: https://opensource.org/join.
ClearlyDefined’s Steering and Outreach Committees Defined
We are excited to announce the newly elected leaders for the ClearlyDefined Steering and Outreach Committees!
What is ClearlyDefined?ClearlyDefined is an Open Source project dedicated to improving the clarity and transparency of Open Source licensing and security data. By harvesting, curating, and sharing essential metadata, ClearlyDefined helps developers and organizations better understand their software components, ensuring responsible and compliant use of Open Source code.
Steering Committee Election Results:Congratulations to E. Lynette Rayle (GitHub), Qing Tomlinson (SAP), and Jeff Mendoza (Kusari/GUAC) for being elected to the ClearlyDefined Steering Committee. These three community leaders will serve a one-year term starting on September 25, 2024. Following election recommendations, the Steering Committee is structured to have an odd number of members (three in this case) and a maximum of one member per company. Lynette Rayle was elected chair of the committee.
The Steering Committee is primarily responsible for setting the project’s technical direction. They oversee processes such as data harvesting, curation, and contribution, ensuring that the underlying architecture functions smoothly. Their focus is on empowering the community, supporting the contributors and maintainers, and fostering collaboration with related projects.
E. Lynette Rayle is a Senior Engineer at GitHub and has been working on ClearlyDefined as a maintainer for just over a year. GitHub is using ClearlyDefined data in several capacities and has a strong stake in ensuring successful outcomes in data quality, performance, and sustainability.
Qing Tomlinson is a Senior Developer at SAP and has been contributing to the ClearlyDefined project since November 2021. SAP has been actively engaged in the ClearlyDefined project since its inception, utilizing the data and actively contributing to its curation. The quality, performance, and long-term viability of the ClearlyDefined project are of utmost importance to SAP.
Jeff Mendoza is a Software Engineer at Kusari, a software supply chain security startup. He is a maintainer of the OpenSSF GUAC project, which consumes ClearlyDefined data. Formerly, Jeff was a full time developer on ClearlyDefined. Jeff brings experience from both the sides of the project, developer and consumer.
Outreach Committee Election Results:We are also thrilled to announce the election of Jeff Luszcz (GitHub), Alyssa Wright (Bloomberg), Brian Duran (SAP), and Nick Vidal (Open Source Initiative) to lead the ClearlyDefined Outreach Committee. They began their one-year term on October 7, 2024. Unlike the Steering Committee, the Outreach Committee has four members, following a consensus reached at the Community meeting that an even number of members is acceptable since tie-breaking votes are less likely. The elected members will select their Chair soon and may also invite other community members to participate.
The Outreach Committee focuses on promoting the project and growing its community. Their responsibilities include organizing events, creating educational materials, and managing communications across various channels, including blogs, social media, and webinars. They help ensure that more users and contributors engage with ClearlyDefined and understand its mission.
Jeff Luszcz is Staff Product Manager at GitHub. Since 2004, he has helped hundreds of software companies understand how to best use open source while complying with their license obligations and keeping on top of security issues.
Alyssa Wright helps lead Bloomberg’s Open Source Program Office in the Office of the CTO, which is the center of excellence for Bloomberg’s engagements with and consumption of open source software.
Brian Duran leads the implementation strategy for adoption of ClearlyDefined within SAP’s open source compliance teams. He has a combined 12 years of experience in open-source software compliance and data quality management.
Nick Vidal is Community Manager at the Open Source Initiative and former Outreach Chair at the Confidential Computing Consortium from the Linux Foundation. Previously, he was the Director of Community and Business Development at the Open Source Initiative and Director of Americas at the Open Invention Network.
Get Involved!We encourage everyone in the ClearlyDefined community to get involved! Whether you’re a developer, data curator, or simply passionate about Open Source software, your contributions are invaluable. Join the conversation, attend meetings, and share your ideas on how to improve and grow the project. Reach out to the newly elected committee members or participate in our upcoming community events.
Let’s work together to drive the ClearlyDefined mission forward! Stay tuned for more updates and opportunities to participate as the committees continue their important work.
Rahmat Akintola: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Rahmat AkintolaWhat’s your background related to Open Source and AI?
Sure. I’ll start with Open Source. My journey began at PyCon Africa in 2019, where I participated in a hackathon on Cookiecutter. At the time, I had just transitioned into web development, and I was looking for ways to improve my skills beyond personal projects. So, I joined the Cookiecutter Academy at Python Africa in 2019. That’s how I got introduced to Open Source.
Since then, I’ve been contributing regularly, starting with one-off contributions to different projects. These days, I primarily focus on code and documentation contributions, mainly in web development.
As for AI, my journey started with data science. I had been working as a program manager and was part of the Women in Machine Learning and Data Science community in Accra, which was looking for volunteers. Coincidentally, I had lost my job at the time, so I applied for the program manager role and got it. That experience sparked my interest in AI. I started learning more about machine learning and AI, and I needed to build my domain knowledge to help with my role in the community.
I’ve worked on traditional models like linear and logistic regression through various courses. Recently, as part of our community, we organized a “Mathematics for Machine Learning” boot camp, where we worked on projects related to reinforcement learning and logistic regression. One dataset I worked with involved predicting BP (blood pressure) levels in the US. The task was to assess the risk of developing hypertension based on various factors.
What motivated you to join this co-design process to define Open Source AI?
The Open Source AI journey started when I was informed about a virtual co-design process that was reaching out to different communities, including mine. As the program lead, I saw it as an opportunity to merge my two passions—Open Source and AI.
I volunteered and worked on testing the OpenCV workbook, as I was using OpenCV at the time. I participated in the first phase, which focused on determining whether certain datasets needed to be open. Unfortunately, I couldn’t participate in the validation phase because I was involved in the mathematics boot camp, but I followed the discussions closely.
When the opportunity came up to participate in the co-design process, I saw it as a chance to bridge my work in Open Source web development and my growing interest in AI. It felt like the perfect moment. I was already using OpenCV, which happened to be part of the AI systems under review, so I jumped right in.
Through the process, I realized that defining Open Source AI goes beyond just using tools or making code contributions—it involves a deep understanding of data, legality, and the broader system.
How did you get invited to speak at the Deep Learning Indaba conference in Dakar? How was the conference experience? Did you make any meaningful connections?
As for speaking at Deep Learning Indaba, the opportunity came unexpectedly. One day, Mer Joyce (the OSAID co-design organizer) sent an email offering a chance to speak on Open Source AI at the conference. I had previously applied to attend but didn’t get in, so I jumped on this opportunity. We used a presentation similar to one May had given at Open Source Community Africa.
I made excellent connections. The conference itself was amazing—though the food and the Senegal experience also played a part! There were many AI and machine learning researchers, and I learned new concepts, like using JAX, which was introduced as an alternative to some common frameworks. The tutorials were well-targeted at beginners, which was perfect for me.
On a personal level, it was great to connect with academics. I’m considering applying for a master’s or Ph.D., and the conference provided an opportunity to ask questions and receive guidance.
Why do you think AI should be Open Source?
AI is becoming a significant part of our lives. I work with the Meltwater Entrepreneurial School of Technology (MEST) as a technical lead, and we use AI for various training purposes. Opening up parts of AI systems allows others to adapt and refine them to suit their needs, especially in localized contexts. For example, I saw someone on Twitter excited about building a GPT for dating, customizing it to ask specific questions.
This ability for people to tweak and refine AI models, even without building them from scratch, is important. Open-sourcing AI enables more innovation and helps tailor models for specific needs, which is why I believe it should be open to an extent.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
One new perspective I gained was on the legal and data availability aspects of AI. Before this, I had never really considered the legal side of things, but during the co-design process, it became clear that these elements are crucial in defining Open Source AI systems. It’s more than just contributing code—it’s about ensuring compliance with legal frameworks and making sure data is available and usable.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
A clear definition would help people understand that Open Source AI involves more than just attaching an MIT or Apache license to a project on GitHub. There’s more complexity around sharing models, data and parameters.
For instance, I was once asked whether using an “Open Source” large language model like LLaMA meant the data had to be open too. A well-defined standard would provide guidance for questions like these, ensuring people understand the legal and technical aspects of making their AI systems Open Source.
What do you think are the next steps for the community involved in Open Source AI?
In Africa, I think the next step is spreading awareness about the Open Source AI Definition. Many people are still unaware of the complexities, and there’s still a tendency to assume that adding an Open Source license to a project automatically makes it open. Building collaborations with local communities to share this information is important.
For women, especially in Africa, visibility is key. When women see others doing similar work, they feel encouraged to join. Representation and community engagement play significant roles in driving diversity in Open Source AI.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
How we passed the AI conundrums
Some people believe that full unfettered access to all training data is paramount. This group argues that anything less than all the data would compromise the Open Source principles, forever removing full reproducibility of AI systems, transparency, security and other outcomes. We’ve heard them and we’ve provided a solution rooted in decades of Open Source practice.
To have the chance for powerful Open Source AI systems to exist in any domain, the OSI community has incorporated in the Definition this principle:
An Open Source AI needs to make available three kinds of components: the software used to create the dataset and run the training, the model parameters and the code to run inference, and finally all the data that can be made available legally.
Recognizing that there are four kinds of “data”, each with its own legal frameworks allowing different freedoms of distribution, we bypass what Stephen O’Grady called the “AI conundrums” and give Open Source AI builders a chance to build freedom-respecting alternatives to pretty much any proprietary AI.
Limiting Open source AI only to systems trainable on freely distributable data would relegate Open Source AI to a niche. One of which is that the amount of freely and legally shareable data is a tiny fraction of what is necessary to train powerful systems. Additionally, it’d be excluding Open Source AI from areas where data cannot be shared, like medical or anything dealing with personal or private data. What remains for “Open Source AI” would be tiny. There are abundant motives to reject this limitation.
The fact is, mixing openly distributable and non-distributable data is very similar to a reality we are very familiar with: Open Source software built with proprietary compilers and system libraries.
Is GNU Emacs Open Source software?I’m sure you’d answer yes (and some of you will say “well, actually it’s free software”) and we’ll all agree. Below is a rough diagram of Emacs built for the GNOME desktop on a modern Linux distribution. Emacs depends on a few system libraries that GNOME provides with OSI-Approved Licenses. The whole stack is Open Source these days and one can distribute Emacs on a disk with all its dependencies without too much legal trouble. Imagine scientists who want to freeze the whole environment of an experiment they made; they could package all the pieces of a system like this without trouble and distribute it all with their paper. No problem here.
Now let’s go back to an age when Linux systems weren’t ready. When Stallman started writing Emacs, there was no GNOME and no Linux, no gcc and no glibc. He thought very early on that in order to have more freedom, he had to create a wedge to allow Emacs to run on proprietary software.
Emacs on the latest Solaris versions would look something like this: some pieces like X11 and Gstreamer are Open Source. Others, like libc and others aren’t. The hypothetical scientists from before couldn’t really freeze their full scientific environment. All they could say in their paper was: “We used Emacs from this CVS version, built with gcc version X with these makefile; tar.gz attached” and make a list of the operating system’s version and libraries versions they used. That’s because they have the right only to distribute Emacs, X11, some libraries and not the rest of Solaris.
Is Emacs on Solaris Open Source? Of course it is, even though the source code for the system libraries are not available.
One more question, Emacs on Mac OS: it can only be built with a proprietary compiler on proprietary GUI and other proprietary libraries.
Is Emacs on Mac Open Source? Of course it is. Can you fully study Emacs on Mac OS? For Emacs, yes. For the MacOS components, no. There are many programs that run only on MacOS or Windows: for OSI, those are Open Source. Would someone argue that they’re not “really Open Source” because you can’t see “everything?” Some people might but we’ve learned to live with that, adding governance rules in addition to those of the Open Source Definition. Debian for example requires that programs are Open Source and support multiple hardware platforms; the ASF graduates only projects that are Open Source and have a diverse community of contributors. If you only want to use Open Source applications running on Open Source stacks, you can decide that! Just as you can decide that your company will only acquire Open Source software whose copyright is owned by multiple entities.
These are all additional requirements built on top of the base floor set by the Open Source Definition.
For AI, you can do the same: You can say “I will only use Open Source AI built with open data, because I don’t want to trust anything less than that.” A large organization could say “I will buy only Open Source AI that allows me to audit their full dataset, including unshareable data.” You can do all that. Open Source AI is the floor that you can build on, like the OSD.
Bypassing the conundrumsWe’ve looked for a solution for almost three years and this is it: Require all the data that is legally shareable, and for the other data provide all the details. It’s exactly what we’ve been doing for Open Source software:
You developed a text editor for Mac OS but you can’t share the system libraries? Fine, we’ll fork it: give us all the code you can legally share with an OSI-Approved License and we’ll rip the dependencies and “liberate” it to run on GNU. The editor will be slightly different, like code that runs on some ARM+Linux systems behaves differently on Intel+Windows for the different capabilities of the underlying hardware and OS, but it’s still Open Source.
For Open Source AI it’s a similar dance: You can’t legally give us all the data? Fine, we’ll fork it. For example, you made an AI that recognizes bone cancer in humans but the data can’t be shared. We’ll fork it! Tell us exactly how you built the system, how you trained it, share the code you used, and an anonymized sample of the data you used so we can train on our X-ray images. The system will be slightly different but it’s still Open Source AI.
If we want to have broad availability of powerful alternatives to proprietary AI systems that respect the freedoms of users and deployers, we must recognize conditions that make sense for the domain of AI. These examples of proprietary compilers and system libraries used to build Open Source software prove that there is room for similar conditions when talking about Code, Data and Parameters within the definition of Open Source AI.
The Open Source Initiative Supports the Open Source Pledge
As businesses rely more heavily on Open Source software (OSS), the strain on maintainers to provide timely updates and security patches continues to grow – often without fair compensation for their crucial work. Recent high-profile security incidents like XZ and Log4Shell have put a spotlight on the security challenges developers face against a backdrop of burnout that has reached an all-time high.
To help address this imbalance, the Open Source Initiative (OSI) supports the Open Source Pledge, launched today by Sentry and partners to support maintainers and inspire a shift toward a healthier work-life balance, and more robust software security practices. The Pledge is a commitment from member companies to pay Open Source maintainers and organizations meaningfully in support of a more sustainable maintainer ecosystem and a reduction of flare-ups of high-profile security incidents.
This Pledge is an attempt to address a problem that has long existed within the Open Source ecosystem. Many companies have built their businesses on top of Open Source software, benefiting from the contributions of maintainers taking them for granted. While they’ve reaped the rewards, the burden has been placed on unpaid or underpaid developers.
It is essential that companies recognize their role in sustaining the ecosystem that powers their innovations. By taking the Pledge, companies have one more instrument to commit to supporting an ecosystem of maintainers and organizations, ensuring the long-term health of the Open Source projects they rely on.
In order to qualify, the projects that companies pledge to should meet the Open Source Definition. You can join the Open Source Pledge by donating to the Open Source Initiative or contacting us to become a sponsor.
Members Newsletter – October 2024
We’re pleased to announce that Release Candidate 1 of the Open Source AI Definition has been confirmed and published! If you’d like to add your name to the list of endorsers published online, please let us know.
We traveled four continents presenting to diverse audiences and soliciting feedback on the draft definition: Deep Learning Indaba in Dakar, Senegal; IndiaFOSS in Bangalore, India; Open Source Summit EU in Vienna, Austria; and Nerdearla in Buenos Aires, Argentina.
The work continues this month as we continue to seek input at the Data in OSAI in Paris, France; at OCX in Mainz, Germany; and during our weekly town hall meetings. And, finally, we’ll be in Raleigh, North Carolina, at the end of the month for All Things Open, where we plan to present the Open Source AI Definition version 1.0!
My thanks to everyone who is contributing to this community-led process. Please continue to let your voice be heard.
Stefano Maffulli
Executive Director, OSI
I hold weekly office hours on Fridays with OSI members: book time if you want to chat about OSI’s activities, if you want to volunteer or have suggestions.
News from the OSI The Open Source AI Definition RC1 is available for commentsThe Open Source AI Definition first Release Candidate has been published and collaboration continues.
Other highlights:
- Co-designing the OSAID: a highlight from Nerdearla
- A Journey toward defining Open Source AI: presentation at Open Source Summit Europe
- Copyright law makes a case for requiring data information rather than open datasets for Open Source AI
- Data Transparency in Open Source AI: Protecting Sensitive Datasets
- Is “Open Source” ever hyphenated?
- Jordan Maris joins OSI
Article from Mark Surman at The New Stack
Other highlights:
- Elastic founder on returning to open source four years after going proprietary (TechCrunch)
- Europe’s Tech Future Hinges on Open Source AI (The New Stack)
- How big new AI regulatory pushes could affect open source (Tech Brew)
- Does Open Source Software Still Matter? (Datanami)
- AI2’s new model aims to be open and powerful yet cost effective (VentureBeat)
- Is that LLM Actually “Open Source”? We need to talk Open-Washing in AI Governance (HackerNoon)
- AI Models From Google, Meta, Others May Not Be Truly ‘Open Source’ (PCMag)
- What’s Behind Elastic’s Unexpected Return to Open Source? (The New Stack)
The Open Policy Alliance reaches 100 members on LinkedIn.
Other newsNews from OSI affiliates:
- The Eclipse Foundation Launches the Open Regulatory Compliance Working Group to Help Open Source Participants Navigate Global Regulations
- LPI and OSI Unite to Professionalize the Global Linux and Open Source Ecosystem
- Apache Software Foundation Initiatives to Fuel the Next 25 Years of Open Source Innovation
- Open source orgs strengthen alliance against patent trolls
- Open Source Foundations Considered Helpful
- Solving the Maker-Taker problem
News from OpenSource.net:
- OpenSource.Net turns one with a redesign
- Beyond the binary: The nuances of Open Source innovation
- Steady in a shifting Open Source world: FreeBSD’s enduring stability
Tidelift’s 2024 State of the Open Source Maintainer Report
More than 400 maintainers responded and shared details about their work.
The State of Open Source Survey
In collaboration with the Eclipse Foundation and Open Source Initiative (OSI).
JobsLead OSI’s public policy agenda and education.
Bloomberg is seeking a Technical Architect to join their OSPO team.
EventsUpcoming events:
- Hacktoberfest (October – Online)
- SOSS Fusion (October 22-23, 2024 – Atlanta)
- Open Community Experience (October 22-24, 2024 – Mainz)
- All Things Open (October 27-29 – Raleigh)
- Nerdearla Mexico (November 7-9, 2024 – Mexico City)
- SeaGL (November 8-9, 2024 – Seattle)
- OpenForum Academy Symposium (November, 13-14, 2024 – Boston)
CFPs:
- FOSDEM 2025 call for devrooms (February 1-2, 2024 – Brussels)
- Consul Conference (February, 4-6, 2025 – Las Palmas de Gran Canaria)
- SCALE 22x (March 6-9, 2025 – Pasadena)
- Mercado Libre
- FerretDB
- Word Unscrambler
Interested in sponsoring, or partnering with, the OSI? Please see our Sponsorship Prospectus and our Annual Report. We also have a dedicated prospectus for the Deep Dive: Defining Open Source AI. Please contact the OSI to find out more about how your company can promote open source development, communities and software.
Support OSI by becoming a member!Let’s build a world where knowledge is freely shared, ideas are nurtured, and innovation knows no bounds!
Co-designing the OSAID: a highlight from Nerdearla
At the 10th anniversary of Nerdearla, one of the largest Open Source conferences in Latin America, Mer Joyce, Co-Design Facilitator of the Open Source AI Definition (OSAID), delivered a key presentation titled “Defining Open Source AI”. Held in Buenos Aires from September 24-28, 2024, this major event brought together 12,000 in-person participants and over 30,000 virtual attendees, with more than 200 speakers from 20 countries. Organized as a free-to-attend event, Nerdearla 2024 exemplified the spirit of Open Source collaboration by providing a platform for developers, enthusiasts, and thought leaders to share knowledge and foster community engagement.
Why is a definition so important?Mer Joyce took the stage at Nerdearla to present “Defining Open Source AI”. Mer’s presentation focused on the organization’s ongoing work to establish a global Open Source AI Definition (OSAID). She emphasized the importance of co-designing this definition through a collaborative, inclusive process that ensures input from stakeholders across industries and continents.
Her talk underscored the significance of defining Open Source AI in the context of increasing AI regulations from governments in the EU, the U.S., and beyond. In her view, defining OSAI is essential for combating “open-washing”—where companies falsely market their AI systems as Open Source while imposing restrictive licenses—and for promoting true openness, transparency, and innovation in the AI space.
A global and inclusive processMer Joyce highlighted the co-design process for the Open Source AI Definition, which has been truly global in scope. Workshops, talks, and activities were held on five continents, including Africa, Europe, Asia, North, and South America, with participants from over 35 countries. These in-person and virtual sessions ensured that voices from a wide range of backgrounds—especially those from underrepresented regions—contributed to shaping the OSAID.
The four freedomsThe core of the OSAID rests on the “Four Freedoms” of Open Source AI:
- Use the system for any purpose and without having to ask for permission.
- Study how the system works and inspect its components.
- Modify the system for any purpose, including to change its output.
- Share the system for others to use with or without modifications, for any purpose.
Four working groups were formed with the intention of identifying what components must be open in order for an AI system to be used, studied, modified, and shared. The working groups focused on Bloom, OpenCV, Llama 2, and Pythia, four systems with different approaches to OSAI.
Each working group voted on the required components and evaluated legal frameworks and legal documents for each component. Subsequently, each working group proceeded to publish a recommendation report.
The end result is the OSAID with a comprehensive definition checklist encompassing a total of 17 components. As part of the validation process, more working groups are being formed to evaluate how well other AI systems align with the definition.
Nerdearla: a platform for open innovationMer Joyce’s presentation at Nerdearla exemplified the broader theme of the conference—creating a more open and collaborative future for technology. As one of the largest Open Source conferences in Latin America, Nerdearla serves as a vital hub for fostering innovation across the Open Source community. By bringing together experts like Mer Joyce to discuss pivotal issues such as AI transparency and openness, the event highlights the importance of defining shared standards for emerging technologies.
Moving forward: the future of the OSAIDThe OSAID is currently in its final stages of development, with version 1.0 expected to be launched at the All Things Open conference in October 2024. The OSI invites individuals and organizations to endorse the OSAID ahead of its official release. This endorsement signifies support for a global definition that aims to ensure AI systems are open, transparent, and aligned with the values of the Open Source movement.
To get involved, participants are encouraged to attend weekly town halls, contribute feedback, and participate in the public review process. Consider endorsing the OSAID to become a part of the movement to define and promote truly Open Source AI systems.
The Open Source AI Definition RC1 is available for comments
A little over a month after v.0.0.9, we have a Release Candidate version of the Open Source AI Definition. This was reached with lots of community feedback: 5 town hall meetings, several comments on the forum and on the draft, and in person conversations at events in Austria, China, India, Ghana, and Argentina.
There are three relevant changes to the part of the definition pertaining to the “preferred form to make modifications to a machine learning system.”
The feature that will draw most attention is the new language of Data Information. It clarifies that all the training data needs to be shared and disclosed. The updated text comes from many conversations with several individuals who engaged passionately with the design process, on the forum, in person and on hackmd. These conversations helped describe four types of data: open, public, obtainable and unshareable data, well described in the FAQ. The legal requirements are different for each. All are required to be shared in the form that the law allows them to be shared.
Two new features are equally important. RC1 clarifies that Code must be complete, enough for downstream recipients to understand how the training was done. This was done to reinforce the importance of the training, both for transparency, security and other practical reasons. Training is where innovation is happening at the moment and that’s why you don’t see corporations releasing their training and data processing code. We believe, given the current status of knowledge and practice, that this is required to meaningfully fork (study and modify) AI systems.
Last, there is new text that is meant to explicitly acknowledge that it is admissible to require copyleft-like terms for any of the Code, Data Information and Parameters, individually or as bundled combinations. A demonstrative scenario is a consortium owning rights to training code and a dataset deciding to distribute the bundle code+data with legal terms that tie the two together, with copyleft-like provisions. This sort of legal document doesn’t exist yet but the scenario is plausible enough that it deserves consideration. This is another area that OSI will monitor carefully as we start reviewing these legal terms with the community.
A note about science and reproducibilityThe aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.
Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. This is why OSD #2 requires that the “source code” must be provided in the preferred form for making modifications. This way everyone has the same rights and ability to improve the system as the original developers, starting a virtuous cycle of innovation. Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias. All these are possible thanks to the requirements of the Open Source AI Definition.
What’s coming nextWith the release candidate cycle starting today, the drafting process will shift focus: no new features, only bug fixes. We’ll watch for new issues raised, watching for major flaws that may require significant rewrites to the text. The main focus will be on the accompanying documentation, the Checklist and the FAQ. We also realized that in our zeal to solve the problem of data that needs to be provided but cannot be supplied by the model owner for good reasons, we had failed to make clear the basic requirement that “if you can share the data you must.” We have already made adjustments in RC1 and will be seeking views on how to better express this in an RC2.
In the next weeks until the 1.0 release of October 28, we’ll focus on:
- Getting more endorsers to the Definition
- Continuing to collect feedback on hackmd and forum, focusing on new, unseen-before concerns
- Preparing the artifacts necessary for the launch at All Things Open
- Iterating on the Checklist and FAQ, preparing them for deployment.
A Journey toward defining Open Source AI: presentation at Open Source Summit Europe
A few weeks ago I attended Open Source Summit Europe 2024, an event organized by the Linux Foundation, that brought together brilliant developers, technologists and leaders from all over the world, reinforcing what Open Source is truly about—collaboration, innovation and community.
I had the honor of leading a session that tackled one of the most critical challenges in the Open Source movement today—defining what it means for AI to be “Open Source.” Along with OSI Board Director Justin Colannino, we presented the v.0.0.9 for the Open Source AI Definition. This session marked an important milestone for both the Open Source Initiative (OSI) and the broader community, a moment that encapsulated years of collaboration, learning and exploration.
The story behind the Open Source AI DefinitionOur session, titled “The Open Source AI Definition Is (Almost) Ready” was more than just a talk—it was an interactive dialogue. As Justin kicked off the session, he captured the essence of the journey we’ve been on. OSI has been grappling with what it means to call AI systems, models and weights “Open Source.” This challenge comes at a time when companies and even regulations are using the term without a clear, agreed-upon definition.
From the outset, we knew we had to get it right. The Open Source values that have fueled so much software innovation—transparency, collaboration, freedom—needed to be the foundation for AI as well. But AI isn’t like traditional software, and that’s where our challenge began.
The origins: a podcast and a visionWhen I first became Executive Director of OSI, I pitched the idea of exploring how Open Source principles apply to AI. We spent months strategizing, and the more we dove in, the more we realized how complex the task would be. We didn’t know much about AI at the time, but we were eager to learn. We turned to experts from various fields—a copyright lawyer, an ethicist, AI pioneers from Eleuther AI and Debian ML, and even an AI security expert from DARPA. Those conversations culminated in a podcast we created called Deep Dive AI, which I highly recommend to anyone interested in this topic.
Through those early discussions, it became clear that AI and machine learning are not software in the traditional sense. Concepts like “source code,” which had been well-defined in software thanks to people like Richard Stallman and the GNU GPL, didn’t apply 1:1 to AI. We didn’t even know what the “program” was in AI, nor could we easily determine the “preferred form for making modifications”—a cornerstone of Open Source licensing.
This realization sparked the need to adapt the Open Source principles we all know so well to the unique world of AI.
Co-designing the future of Open Source AIOnce we understood the scope of the challenge, we knew that creating this definition couldn’t be a solo endeavor. It had to be co-designed with the global community. At the start of 2023, we had limited resources—just two full-time staff members and a small budget. But that didn’t stop us from moving forward. We began fundraising to support a multi-stakeholder, global conversation about what Open Source AI should look like.
We brought on Mer Joyce, a co-design expert who introduced us to creative methods that ensure decisions are made with the community, not for it. With her help, we started breaking the problem into smaller pieces and gathering insights from volunteers, AI experts and other stakeholders. Over time, we began piecing together what would eventually become v.0.0.9 of the Open Source AI Definition.
By early 2024, we had outlined the core principles of Open Source AI, drawing inspiration from the free software movement. We relied heavily on foundational texts like the GNU Manifesto and the Four Freedoms of software. From there, we built a structure that mirrored the values of freedom, collaboration and openness, but tailored specifically to the complexities of AI.
Addressing the unique challenges of AIOf course, defining the freedoms was only part of the battle. AI and machine learning systems posed new challenges that we hadn’t encountered in traditional software. One of the key questions we faced was: What is the preferred form for making modifications in AI? In traditional software, this might be source code. But in AI, it’s not so straightforward. We realized that the “weights” of machine learning models—those parameters fine-tuned by data—are crucial. However, data itself doesn’t fit neatly into the Open Source framework.
This was a major point of discussion during the session. Code and weights need to be covered by an OSI-approved license because they represent the modifiable core of AI systems. However, data doesn’t meet the same criteria. Instead, we concluded that while data is essential for understanding and studying the system, it’s not the “preferred form” for making modifications. Instead, the data information and code requirements allow Open Source AI systems to be forked by third-party AI builders downstream using the same information as the original developers. These forks could include removing non-public or non-open data from the training dataset, in order to retrain a new Open Source AI system on fully public or open data. This insight was shaped by input from the community and experts who joined our study groups and voted on various approaches.
The road ahead: a collaborative futureAs we wrap up this phase, the next step is gathering even more feedback from the community. The definition isn’t final yet, and it will continue to evolve as we incorporate insights from events like this summit. I’m incredibly grateful for the thoughtful comments we’ve already received from people all over the world who have helped guide us along this journey.
At the core of this project is the belief that Open Source AI should reflect the same values that have made Open Source a force for good in software development. We’re not there yet, but together, we’re building something that will have a lasting impact—not just on AI, but on the future of technology as a whole.
I want to thank everyone who has contributed to this project so far. Your dedication and passion are what make Open Source so special. Let’s continue to shape the future of AI, together.
Is “Open Source” ever hyphenated?
No! Open Source is never hyphenated when referring to software. If you’re familiar with English grammar you may have more than an eyebrow raised: read on, we have an explanation. Actually, we have two.
We asked Joseph P. De Veaugh-Geiss, a linguist and KDE’s project manager, to provide us with an explanation. If that’s not enough, we have one more argument at the end of this post.
Why Open Source is not hyphenatedIn summary:
- “open source” (no hyphen) is a lexicalized compound noun which is no longer transparent with respect to its meaning (i.e., open source is not just about being source-viewable, but also about defining user freedoms) which can then be further compounded (with for example “open source license”);
- by contrast, “open-source” (with a hyphen) is a compound modifier modifying the head noun (e.g. “intelligence”) with open having a standard dictionary meaning (i.e., “transparent” or “open to or in view of all”).
“Open source” is a lexicalized compound noun. Although it originates with the phrase “open source software”, today “open source” is itself a unique lexeme. An example, in Red Hat’s article:
Open source has become a movement and a way of working that reaches beyond software production.
The word open in “open source” does not have the meaning “open” as one would find in the dictionary. Instead, “open source” also entails user freedoms, inasmuch as users of the software for any purpose do not have to negotiate with the rights owners to enjoy (use/improve/share/monetise) the software. That is, it is not only about transparency.
A natural example of this usage, in which the phrase open source license is clearly about more than just licensing transparency:
Because Linux is released under an open source license, which prevents restrictions on the use of the software, anyone can run, study, modify, and redistribute the source code, or even sell copies of their modified code, as long as they do so under the same license.” (from Red Hat website https://www.redhat.com/en/topics/open-source/what-is-open-source)
Note that “open source license” is itself a compound noun phrase made up of the lexicalized compound noun “open source” + the noun “license”; same for “open source movement”, etc.
What is lexicalization?According to the Lexicon of linguistics (Utrecht University), ‘lexicalization’ is a “phenomenon by which a morphologically complex word starts to behave like an underived word in some respect, which means that at least one feature (semantic, syntactic, or phonological) becomes unpredictable”.
Underived word here means the phrase has a specific, unique meaning not (necessarily) transparent from its component parts. For instance, a “black market” is not a market which is black but rather a specific kind of market: an illegal one. A “blackboard” can be green. In other words, the entire complex phrase can be treated as a single unit of meaning stored in the mental lexicon. The meaning of the phrase is not derived using grammatical rules.
Today, the meaning of open source is unpredictable or semantically intransparent given its usage (at least by a subset of speakers) and meaning, i.e., open source is about user freedoms, not just transparency.
Other examples of lexicalized compound nouns include “yellow journalism”, “purple prose”, “dirty bomb”, “fat chance”, “green card”, “blackbird”, “greenhouse”, “high school”, etc. I tried to think of examples which are composed of adjectives + nouns but with a specific meaning not derivable by the combination of the two. I am sure you can come up with many more!
In some cases, lexicalization results in writing the compound noun phrase together as a single word (‘blackboard’), in other cases not (‘green card’). One can also build larger phrases by combining the lexicalized compound noun with another noun (e.g., black market dealer, green card holder).
Hyphenated open-source is a compound modifierBy contrast, open in “open-source intelligence” is the dictionary meaning of “open”, i.e., “open to or in view of all” or “transparent”. In this case, open-source is a compound modifier/compound adjective with a meaning comparable to “source-viewable”, “source-available”, “source-transparent”.
For compound modifiers, the hyphenation, though not obligatory, is common and can be used to disambiguate. The presence of a head noun like “intelligence” or “journalism” is obligatory for the compound-modifier use of open-source, unlike in lexicalized compounds.
Examples of other compound modifiers + a head noun: “long-term contract”, “single-word modifier”, “high-volume printer”, etc.
ExamplesThere are some examples of the compound-modifier use on Wikipedia where I think the difference between meanings lexicalized compound noun and compound modifier becomes clear:
“Open-source journalism, a close cousin to citizen journalism or participatory journalism, is a term coined in the title of a 1999 article by Andrew Leonard of Salon.com.” (from Wikipedia)
“Open-source intelligence” is intelligence “produced from publicly available information that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement” (from Wikipedia)
In these examples open-source is clearly referring to transparent, viewable-to-all sources and not to something like ‘guaranteeing user freedoms’. Moreover, my intuition for these latter examples is that removing the hyphen would change the meaning, however subtle it may be, and the change could make the original sentences incoherent (without implicit internal modification while reading):
- “open source journalism” would refer to journalism about open source software (in sense I above), not transparent, participatory journalism;
- “open source intelligence” would refer to intelligence about open source software (in sense I above, whatever that would mean!), not intelligence from publicly available information.
If that explanation still doesn’t convince you, we invoke the rules of branding and “pull a Twitter”, who vandalized English with their Who To Follow : we say no hyphen!
Luckily others have already adopted the “no hyphen” camp, like the CNCF style guide. Debate closed.
If you like debates, let’s talk about capitalization: OSI in its guidelines chose to always capitalize Open Source because it is a proper noun with a specific definition. Which camp are you on?
Data Transparency in Open Source AI: Protecting Sensitive Datasets
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Tarunima PrabhakarI am the research lead and co-founder at Tattle, a civic tech organization that builds citizen centric tools and datasets to respond to inaccurate and harmful content. My broad research interests are in the intersection of technology, policy and global development. Prior to starting Tattle, I worked as a research fellow at the Center for Long-Term Cybersecurity at UC, Berkeley studying the deployment of behavioral credit scoring algorithms towards financial inclusion goals in the global majority. I’ve also been fortunate to work on award-winning ICTD and data driven development projects with stellar non-profits. My career working in low-resource environments has turned me into an ardent advocate for Open Source development and citizen science movements.
Protecting Sensitive DatasetsI recently gave a lightning talk at IndiaFOSS where I shared about Uli, a project to co-design solutions to online gendered abuse in Indian languages. As a part of this project, we’re building and maintaining datasets that are useful for machine learning models that detect abuse. The talk exhibited the importance of and the care that must be given when choosing a license for sensitive data, and why open datasets in Open Source AI should be carefully considered.
With the Uli project, we created a dataset annotated by gender rights activists and researchers who speak Hindi, Tamil and Indian English. Then, we fine-tuned Twitter’s XLM-RoBERTa model to detect gender abuse, which we deployed as a browser plugin. When activated, the Uli plugin would redact abusive tweets from a person’s feed. Another dataset we created was of slur words in the three languages that might be used to target people. Such a list is not only useful for the Uli plugin- these words are redacted from web pages if the plugin is installed- but they are also useful for any platform needing to moderate conversations in these languages. At the time of the launch of the plugin, we chose to license the two datasets under an Open Data License (ODL). The model is hosted on Hugging Face and the code is available on GitHub.
As we have continued to maintain and grow Uli, we have reconsidered how we license the data. When thinking about how to license this data, several factors come into play. First, annotating a dataset on abuse is labor-intensive and mentally exhausting, and the expert annotators should be fairly compensated for their expertise. Second, when these datasets are used by platforms for abuse detection, it creates a potential loophole—if abusive users realize the list of flagged words is public, they can change their language to evade moderation.
These concerns have led us to think carefully about how to license the data. On one end of the spectrum, we could continue to make everything open, regardless of commercial use. On the other end, we could keep all the data closed. We’ve historically operated as an Open Source organization, and every decision we make about data access impacts how we license our machine learning models as well. We are trying to find a happy medium that lets us balance the numerous concerns- recognition of effort and effectiveness of the data on one hand, and transparency, adaptability and extensibility on the other.
As we’ve thought about different strategies for data licensing, we haven’t been sure what that would mean for the license of the machine learning models. And that’s partly because we don’t have a clear definition for what “Open Source AI” really means.
It is for this reason that we’ve closely followed the Open Source Initiative’s (OSI) process for converging on a definition for Open Source AI. OSI has been grappling with the definition of “Open Source AI” as it pertains to the four freedoms: the freedom to use, study, modify, and share. Over the past year, the OSI has been iterating on a definition for Open Source AI, and they’ve reached a point where they propose the following:
- Open weights: The model weights and parameters should be open.
- Open source code: The source code used to train the system should be open.
- Open data or transparent data: Either the dataset should be open, or there should be enough detailed information for someone to recreate the dataset.
It’s important to note that the dataset doesn’t necessarily have to be open. The departure from a stance of maximally open dataset accounts for the complexity in the collection and management of data driving real world ML applications. While frontier models need to deal with copyright and privacy concerns, many smaller projects like ours worry about the uneven power dynamics between those creating the data and the entities using it. In our specific case, opening data also reduces its efficacy.
But having struggled with papers that describe research or data without sharing the dataset itself, I also recognize that ‘enough detailed information’ might not be information enough to repeat, adapt or extend another group’s work. In the end, the question becomes: how much information about the dataset is enough to consider the model “open?” It’s a fine line, and not everyone is comfortable with OSI’s stance on this issue. For our project in particular, we are considering the option of staggered data release- older data is released under an open data license, while the newest data requires users to request access.
If you have strong opinions on this process, I encourage you to visit the OSI website and leave feedback. The OSI process is influential, and your input on open weights, open code, and their specifications around data openness could shape the future of Open Source AI.
You can learn more about the participatory process behind the Uli dataset here, and about Uli and Tattle on their respective websites.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Open Source AI Definition – Weekly update September 23
- @nemobis points out that the term “skilled person” in the Open Source AI Definition needs clarification, especially when considering different legal systems. The term could lead to misinterpretations and suggests adjusting the wording to focus on access to data. Additionally, the term “substantially equivalent system” also requires a more precise definition.
- @shujisado adds that in Japan, the term “skilled person” is linked to patent law, which could complicate its interpretation. He proposes using a simpler term, like “person skilled in technology,” to avoid unnecessary debate.
- @stefano asks for suggestions for a better alternative to “skilled person,” such as “practitioner” or “AI practitioner.”
- @kjetilk jokingly suggests lowering the bar to “any random person with a computer,” emphasizing the importance of accessibility in open source, allowing anyone to engage regardless of formal training.
- @samj highlights that byte-for-byte reproducibility is unrealistic, as randomness and hardware variability make exact replication unachievable, similar to how different binaries perform equivalently despite differing checksums.
- @samj notes the existence of models like StarCoder2 and OLMo as examples of Open Source AI, refuting the claim that no models meet the standard. He stresses the need for the definition to encourage the development of new models rather than settling for an inadequate status quo.
- @kjetilk reflects on Mark Zuckerberg’s blog post about Llama 3.1, where Zuckerberg claims that “Open Source AI Is the Path Forward.” He points out that while it’s easy to agree with Zuckerberg’s sentiment, Llama 3.1 isn’t truly open source and wouldn’t meet the criteria for compliance under the OSAID. This raises important questions about how to engage with Meta: should the open-source community push them away, or guide them toward creating OSAID-compliant models? Furthermore, @kjetilk wonders how this affects perceptions of open source, especially in light of EU legislation and the broader governance issues around open source.
- @shujisado responds by noting that the Open Source Initiative (OSI) has already made it clear that Llama 2 (and by extension Llama 3.1) does not meet the Open Source definition, despite Zuckerberg’s claims. He suggests that Zuckerberg might be using a different definition of “open source,” particularly given the unclear legal landscape around AI training data and copyright. In his view, the creation of the Open Source AI Definition (OSAID) is the community’s formal response to Meta’s claims.
- The seventeenth edition of our town hall meetings was held on the 20th of September. If you missed it, the recording and slides can be found here.
David Manset: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet David MansetWhat’s your background related to Open Source and AI?
My background in Open Source and AI is shaped by my ongoing experience as the senior coordinator of the Open Source Ecosystem Enabler (OSEE) project at the United Nations International Telecommunication Union (ITU), a project developed in collaboration with the UN Development Program and under the funding of the EU’s Directorate-General for International Partnerships, to support countries developing digital public goods and services using Open Source. In this capacity, a significant part of my work involves driving initiatives related to Open Source for various types of use cases in the public sector.
Witnessing the birth of an Open Source AI definition during the DPGA Annual Members meeting in 2023, I have since then been contributing to the Open Source AI agenda, and more recently to various Open Source AI initiatives within the ITU Open Source Program Office (OSPO). Additionally, I co-lead the Open Source AI for Digital Public Goods (OSAI4DPG) track at AI for Good, focusing on creating AI-driven public goods that are both accessible and affordable.
One of my recent achievements includes co-organizing the AIntuition hackathon aimed at developing cost-effective Open Source AI solutions. This event focused on utilizing Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) to create a basic yet understandable and adaptable prototype implementation for public administration. My efforts in this area highlight my commitment to practical and usable AI tools that meet public sector needs.
Prior to my role at ITU, I worked in the private sector, where I developed AI services that enhanced healthcare services and protected patients/citizens. This experience gives me a well-rounded perspective on implementing and scaling Open Source AI technologies for public benefit.
What motivated you to join this co-design process to define Open Source AI?
My motivation to participate in this co-design process for defining Open Source AI is deeply rooted in my former experiences in software development and as the coordinator of the OSEE project, where my focus lies in enhancing digital public services and developing digital public goods. Open Source AI indeed presents a unique opportunity, especially for the public sector, to adopt cost-effective and scalable solutions that can significantly improve public services. However, to harness these benefits, it is imperative to establish a clear, standardized and consensual definition of Open Source AI. This definition will serve as a foundational guideline, ensuring transparency and understanding of the specific types of AI technologies being developed and implemented.
Moreover, my involvement is driven by the critical work of the ITU OSPO, particularly in developing Open Source AI solutions tailored for low- and middle-income countries (LMICs). These regions often face challenges such as scarce resources and limited representation in global AI training processes. By contributing to the development of Open Source AI, I aim to support these countries in accessing affordable and effective AI technologies, thereby promoting greater equity in AI development and utilization. This effort is not just about technology but also about fostering global inclusivity and ensuring that the benefits of AI are accessible to all.
Why do you think AI should be Open Source?
AI should be Open Source for several compelling reasons, especially when considering its potential impact on global development and governance. First, transparency, traceability and explainability are crucial, particularly in digital public services. Open Source AI allows public scrutiny of the algorithms and models used, ensuring that decision-making processes are transparent and accountable. This is vital for building trust in AI systems, especially in sectors like healthcare, education and public administration, where decisions can significantly impact individuals and communities.
Second, accessibility and affordability are key benefits for LMICs. Open Source AI lowers the barriers to entry, enabling these countries to access cutting-edge technologies without the prohibitive costs associated with proprietary systems. This democratization of AI technology ensures that even resource-constrained nations can harness AI’s transformative potential. Moreover, Open Source AI fosters greater representation and competition for LMICs in the global AI landscape. By contributing to and benefiting from Open Source projects, these countries can influence AI development and ensure that their specific needs and contexts are considered.
Finally, as AI increasingly becomes a foundational technology, Open Source serves as a universal resource that can be adapted and improved by anyone, promoting innovation and inclusivity across the globe.
What new perspectives or ideas did you encounter while participating in the co-design process?
Participating in the co-design process introduced me to several new perspectives and ideas that have deepened my understanding of the role of Open Source AI, particularly in supporting global development. One key insight is the realization that LMICs would significantly benefit from having access to an Open Source AI reference implementation. This concept, which we are actively working on, would provide these countries with a practical, ready-to-use model for AI development, helping them overcome resource constraints and accelerate their AI initiatives.
Another important perspective is that Open Source AI requires solid foundational elements—an Open Source mindset, adherence to best practices, and generalized policies must be embedded across all organizations involved. This is not just about technology; it’s about fostering a culture and infrastructure that supports Open Source principles at every level. Notably, ITU is now coordinating the definition of a common policy framework for United Nations Open Source initiatives, which will be crucial in guiding future Open Source AI developments. This framework will ensure that Open Source AI projects are supported by robust Open Source policies, promoting sustainable and equitable technological advancement worldwide.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
The primary benefit of a clear definition of Open Source AI will be the establishment of a unified framework that ensures transparency, accessibility, and ethical standards in AI development. This clarity will enable broader adoption across various sectors, particularly in LMICs, by providing a reliable foundation for building and implementing AI technologies. It will also foster global collaboration, ensuring that AI advancements are inclusive and equitable, while promoting innovation through open contributions, ultimately leading to more trustworthy and widely beneficial AI solutions.
What do you think are the next steps for the community involved in Open Source AI?
Once a global standard definition of Open Source AI is established, the Open Source AI community should focus on several key steps to ensure its widespread adoption and effective implementation. These include developing comprehensive guidelines and best practices, creating reference implementations to help organizations, particularly in LMICs, adopt the standard, and enhancing global collaboration through international networks and partnerships. Additionally, launching education and awareness campaigns will be crucial for informing stakeholders about the benefits and practices of Open Source AI. Establishing a governance and compliance framework will help maintain the integrity of AI projects, while supporting policy development and advocacy will ensure alignment with national and international regulations. Finally, fostering innovation and research through funding, hackathons, and collaborative platforms will drive ongoing advancements in Open Source AI. These steps will help build a robust, inclusive, and impactful Open Source AI ecosystem that benefits societies globally.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Open Source AI Definition – Weekly update september 16
- OSI invites individuals and organizations to endorse the Open Source AI Definition (OSAID). Endorsers will have their name and affiliation listed in the press release for Release Candidate 1 (RC1), which is expected to be finalized by the end of September. Those endorsing version 0.0.9 will be contacted again to confirm their support if there are any changes leading up to RC1.
- @mjbommar encourages reviewing the U.S. Copyright Office’s guidance on text and data mining (TDM) exceptions, which provides clear explanations and limitations, especially focusing on non-commercial, scholarly, and teaching uses. He emphasizes that the TDM guidance operates within narrow parameters that are often misunderstood or overlooked.
- @quaid proposes adding nuance to the Open Source AI (OSAI) Definition by introducing two designations: OSAI D+ (with open data) and OSAI D- (without open data, due to legitimate reasons beyond the creator’s control). He suggests using a dataset certificate of origin (dataset DCO) for self-verification to ensure compliance.
- @kjetilk agrees that verification is key but questions whether data information alone is sufficient for verification. He highlights that verifying rights to the data may not always be possible.
- @stefano appreciates the quadrant system’s clarity and confirms @quaid’s proposal for OSAI D- to be reserved for those with legitimate reasons for not sharing data.
- @thesteve0 expresses skepticism about broadening the “Open Source” label. He argues that without access to both data and code, AI models cannot truly be Open Source and suggests labeling such models as “open weights” instead.
- @shujisado notes the importance of data access in AI, pointing out that OSAID requires detailed information about how data is sourced, including provenance and selection criteria. He also discusses potential legal and ethical reasons for not sharing datasets.
- @Shamar raises concerns about “openwashing” in AI, where developers might distribute a model with a different dataset, undermining trust. He argues that distinguishing between OSAI D+ and D- risks legal complications for derivative works, suggesting that models without open data should not be considered truly open.
- @zack supports the idea of a tiered system (D+ and D-) as an improvement over the current situation, as it incentivizes progress from D- to D+. He is skeptical about verifiability but sees potential in the branding aspect of the proposal.
- @stefano asks @arandal about suggested edits, which include renaming data as “source data,” allowing open-source AI developers to require downstream modifications with open data, and permitting downstream developers to use open data to fine-tune models trained on non-public data. He further asks if arandal compares training data to model weights as source code is to binary code.
- @shujisado agrees with @stefano and points out that while many interpret OSD-compliant licenses to include CC4 and CC0, OSI has not officially evaluated Creative Commons licenses for compliance. He highlights concerns about CC0’s patent defense, which could be crucial for datasets.
- @mjbommar echoes the concerns about patent defense, noting it as a critical issue in both software and data licensing.
- @Shamar supports the first two suggestions but argues that models trained on non-public data cannot meet an “Open Source AI” definition, as they limit the freedom to study and modify, which are core principles of Open Source.
- @nick shares an article by Nathan Lambert, reviewed by key figures in the Open Source AI space, discussing the challenges of training data and the current Open Source AI definition. @Percy Liang (on X) view is highlighted, where he suggests that releasing an entire dataset is neither sufficient nor necessary for Open Source AI. He emphasizes the need for detailed code of the data processing pipeline for transparency, beyond just releasing the dataset.
- @shujisado discusses the legal nuances of using U.S. government documents in AI training, emphasizing that while they may be used in the U.S., legal complications arise in other jurisdictions.
- @Shamar stresses that Open Source AI should provide all the necessary data and processing information to recreate a system, otherwise, calling it Open Source is “open washing.”
- @Shamar proposes a clearer distinction between “source data” and “processing information” in the Open Source AI definition to ensure transparency and reproducibility. He suggests source data should be publicly available under the same terms that allowed its original use, while the process used to train the system should be shared under an Open Source license. His formulation aims to prevent loopholes that could lead to open-washing and emphasizes the importance of granting all four freedoms (study, modify, distribute, and use) to qualify as Open Source AI.
- @nick disagrees, arguing that @Shamar proposal misunderstands the difference between the rights to use data for training and the rights to distribute it. He also challenges the claim that exact replication of AI systems can be guaranteed, even with access to the same data.
- The sixteenth edition of our town hall meetings was held on the 13th of September. If you missed it, the recording and slides can be found here.
Copyright law makes a case for requiring data information rather than open datasets for Open Source AI
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Felix Reda Photo Credit: CC-by 4.0 International Volker Conradus volkerconradus.com.Felix Reda (he/they) has been an active contributor to the Open Source AI Definition (OSAID) co-design process, bringing his personal interest and expertise in copyright reform to the online forums. Working in digital policy for over ten years, including serving as a member of the European Parliament from 2014 to 2019 and working with the strategic litigation NGO Gesellschaft für Freiheitsrechte (GFF), Felix is currently the director of developer policy at GitHub. He is also an affiliate of the Berkman Klein Center for Internet and Society at Harvard and serves on the board of the Open Knowledge Foundation Germany. He holds an M.A. in political science and communications science from the University of Mainz, Germany.
Data information as a viable alternativeNote: The original text was contributed by Felix Reda to the discussions happening on the Open Source AI forum as a response to Stefano Maffulli’s post on how the draft Open Source AI Definition arrived at its current state, the design principles behind the data information concept and the constraints (legal and technical) it operates under.
When we look at applying Open Source principles to the subject of AI, copyright law comes into play, especially for the topic of training data access. Open datasets have been a continuous discussion point in the collaborative process of writing the Open Source AI Definition. I would like to explain why the concept of data information is a viable alternative for the purposes of the OSAID.
The definition of Open Source software has an access element and a legal element – the access element being the availability of the source code and the legal element being a license rooted in the copyright-protection given to software. The underlying assumption is that the entity making software available as Open Source is the rights holder in the software and is therefore entitled to make the source code available without infringing the copyright of a third party, and to license it for re-use. To the extent that third-party copyright-protected material is incorporated into the Open Source software, it must itself be released under a compatible Open Source license that also allows the redistribution.
When it comes to AI, the situation is fundamentally different: The assumption that an Open Source AI model will only be trained on copyright-protected material that the developer is entitled to redistribute does not hold. Different copyright regimes around the world, including the EU, Japan and Singapore, have statutory exceptions that explicitly allow text and data mining for the purposes of AI training. The EU text and data mining exceptions, which I know best, were introduced with the objective of facilitating the development of AI and other automated analytical techniques. However, they only allow the reproduction of copyright-protected works (aka copying), but not the making available of those works (aka posting them on the internet).
That means that an Open Source AI definition that would require the republication of the complete dataset in order for an AI model to qualify as Open Source would categorically exclude Open Source AI models from the ability to rely on the text and data mining exceptions in copyright – that is despite the fact that the legislator explicitly decided that under certain circumstances (for example allowing rights holders to declare a machine-readable opt-out from training outside of the context of scientific research) the use of copyright-protected material for the purposes of training AI models should be legal. This result would be particularly counterproductive because it would even render Open Source AI models illegal in situations where the reproducibility of the dataset would be complete by the standards discussed on the OSAID forum.
Examples
Imagine an AI model that was trained on publicly accessible text on the internet that was version-controlled, for which the rights holder had not declared an opt-out, but which the rights holder had also not put under a permissive license (all rights reserved). Using this text as training data for an AI model would be legal under copyright law, but re-publishing the training dataset would be illegal. Publishing information about the training dataset that included the version of the data that was used, when and how it was retrieved from which website, and how it was tokenized would meet the requirements of the OSAID v 0.0.8 if (and only if) it put a skilled person in the position to build their own dataset to recreate an equivalent system.
Neither the developer of the original Open Source AI model nor the skilled person recreating it would violate copyright law in the process, unlike the scenario that required publication of the dataset. Including a requirement in the OSAID to publish the data, in which the AI developer typically does not hold the copyright, would have little added benefit but would drastically reduce the material that could be used for training, despite the existence of explicit legal permissions to use that content for AI training. I don’t think that would be wise.
The international concern of public domain
While I support the creation of public domain datasets that can be republished without restrictions, I would like to caution against pointing to these efforts as a solution to the problem of copyright in training datasets. Public domain status is not harmonized internationally – what is in the public domain in one jurisdiction is routinely protected by copyright in other parts of the world. For example, in US discourse it is often assumed that works generated by US government employees are in the public domain. They are not, they are only in the public domain in the US, while they are copyright-protected in other jurisdictions.
The same goes for works in which copyright has expired: Although the Berne Convention allows signatory countries to limit the copyright term on works until protection in the work’s country of origin has expired, exceptions to this rule are permitted. For example, although the first incarnation of Mickey Mouse has recently entered the public domain in the US, it is still protected by copyright in Germany due to an obscure bilateral copyright treaty between the US and Germany from 1892. Copyright protection is not conditional on registration of a work, and no even remotely comprehensive, reliable rights information on the copyright status of works exists. Good luck to an Open Source AI developer who tried to stay on top of all of these legal pitfalls.
Bottom line
There are solid legal permissions for using copyright-protected works for AI training (reproductions). There are no equivalent legal permissions for incorporating copyright-protected works into publishable datasets (making available). What an Open Source AI developer thinks is in the public domain and therefore publishable in an open dataset regularly turns out to be copyright-protected after all, at least in some jurisdictions.
Unlike reproductions, which only need to follow the copyright law of the country in which the reproduction takes place, making content available online needs to be legal in all jurisdictions from which the content can be accessed. If the OSAID required the publication of the dataset, this would routinely lead to situations where Open Source AI models could not be made accessible across national borders, thus impeding their collaborative improvement, one of the great strengths of Open Source. I doubt that with such a restrictive definition, Open Source AI would gain any practical significance. Tragically, the text and data mining exceptions that were designed to facilitate research collaboration and innovation across borders, would only support proprietary AI models, while excluding Open Source AI. The concept of data information will help us avoid that pitfall while staying true to Open Source principles.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.