FLOSS Research
Copyright law makes a case for requiring data information rather than open datasets for Open Source AI
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Felix Reda Photo Credit: CC-by 4.0 International Volker Conradus volkerconradus.com.Felix Reda (he/they) has been an active contributor to the Open Source AI Definition (OSAID) co-design process, bringing his personal interest and expertise in copyright reform to the online forums. Working in digital policy for over ten years, including serving as a member of the European Parliament from 2014 to 2019 and working with the strategic litigation NGO Gesellschaft für Freiheitsrechte (GFF), Felix is currently the director of developer policy at GitHub. He is also an affiliate of the Berkman Klein Center for Internet and Society at Harvard and serves on the board of the Open Knowledge Foundation Germany. He holds an M.A. in political science and communications science from the University of Mainz, Germany.
Data information as a viable alternativeNote: The original text was contributed by Felix Reda to the discussions happening on the Open Source AI forum as a response to Stefano Maffulli’s post on how the draft Open Source AI Definition arrived at its current state, the design principles behind the data information concept and the constraints (legal and technical) it operates under.
When we look at applying Open Source principles to the subject of AI, copyright law comes into play, especially for the topic of training data access. Open datasets have been a continuous discussion point in the collaborative process of writing the Open Source AI Definition. I would like to explain why the concept of data information is a viable alternative for the purposes of the OSAID.
The definition of Open Source software has an access element and a legal element – the access element being the availability of the source code and the legal element being a license rooted in the copyright-protection given to software. The underlying assumption is that the entity making software available as Open Source is the rights holder in the software and is therefore entitled to make the source code available without infringing the copyright of a third party, and to license it for re-use. To the extent that third-party copyright-protected material is incorporated into the Open Source software, it must itself be released under a compatible Open Source license that also allows the redistribution.
When it comes to AI, the situation is fundamentally different: The assumption that an Open Source AI model will only be trained on copyright-protected material that the developer is entitled to redistribute does not hold. Different copyright regimes around the world, including the EU, Japan and Singapore, have statutory exceptions that explicitly allow text and data mining for the purposes of AI training. The EU text and data mining exceptions, which I know best, were introduced with the objective of facilitating the development of AI and other automated analytical techniques. However, they only allow the reproduction of copyright-protected works (aka copying), but not the making available of those works (aka posting them on the internet).
That means that an Open Source AI definition that would require the republication of the complete dataset in order for an AI model to qualify as Open Source would categorically exclude Open Source AI models from the ability to rely on the text and data mining exceptions in copyright – that is despite the fact that the legislator explicitly decided that under certain circumstances (for example allowing rights holders to declare a machine-readable opt-out from training outside of the context of scientific research) the use of copyright-protected material for the purposes of training AI models should be legal. This result would be particularly counterproductive because it would even render Open Source AI models illegal in situations where the reproducibility of the dataset would be complete by the standards discussed on the OSAID forum.
Examples
Imagine an AI model that was trained on publicly accessible text on the internet that was version-controlled, for which the rights holder had not declared an opt-out, but which the rights holder had also not put under a permissive license (all rights reserved). Using this text as training data for an AI model would be legal under copyright law, but re-publishing the training dataset would be illegal. Publishing information about the training dataset that included the version of the data that was used, when and how it was retrieved from which website, and how it was tokenized would meet the requirements of the OSAID v 0.0.8 if (and only if) it put a skilled person in the position to build their own dataset to recreate an equivalent system.
Neither the developer of the original Open Source AI model nor the skilled person recreating it would violate copyright law in the process, unlike the scenario that required publication of the dataset. Including a requirement in the OSAID to publish the data, in which the AI developer typically does not hold the copyright, would have little added benefit but would drastically reduce the material that could be used for training, despite the existence of explicit legal permissions to use that content for AI training. I don’t think that would be wise.
The international concern of public domain
While I support the creation of public domain datasets that can be republished without restrictions, I would like to caution against pointing to these efforts as a solution to the problem of copyright in training datasets. Public domain status is not harmonized internationally – what is in the public domain in one jurisdiction is routinely protected by copyright in other parts of the world. For example, in US discourse it is often assumed that works generated by US government employees are in the public domain. They are not, they are only in the public domain in the US, while they are copyright-protected in other jurisdictions.
The same goes for works in which copyright has expired: Although the Berne Convention allows signatory countries to limit the copyright term on works until protection in the work’s country of origin has expired, exceptions to this rule are permitted. For example, although the first incarnation of Mickey Mouse has recently entered the public domain in the US, it is still protected by copyright in Germany due to an obscure bilateral copyright treaty between the US and Germany from 1892. Copyright protection is not conditional on registration of a work, and no even remotely comprehensive, reliable rights information on the copyright status of works exists. Good luck to an Open Source AI developer who tried to stay on top of all of these legal pitfalls.
Bottom line
There are solid legal permissions for using copyright-protected works for AI training (reproductions). There are no equivalent legal permissions for incorporating copyright-protected works into publishable datasets (making available). What an Open Source AI developer thinks is in the public domain and therefore publishable in an open dataset regularly turns out to be copyright-protected after all, at least in some jurisdictions.
Unlike reproductions, which only need to follow the copyright law of the country in which the reproduction takes place, making content available online needs to be legal in all jurisdictions from which the content can be accessed. If the OSAID required the publication of the dataset, this would routinely lead to situations where Open Source AI models could not be made accessible across national borders, thus impeding their collaborative improvement, one of the great strengths of Open Source. I doubt that with such a restrictive definition, Open Source AI would gain any practical significance. Tragically, the text and data mining exceptions that were designed to facilitate research collaboration and innovation across borders, would only support proprietary AI models, while excluding Open Source AI. The concept of data information will help us avoid that pitfall while staying true to Open Source principles.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Jordan Maris joins OSI
Helen Keller said, “Alone we can do so little; together we can do so much.” Although she wouldn’t have understood this 2024 expression, we know “she nailed it.” It takes many of us working together to truly accomplish great things. That’s why the OSI staff is so excited to welcome Jordan Maris to our team.
As OSI’s European Policy Analyst, Jordan will work to build a bridge between European Union legislators, the OSI and the wider Open Source community. He will monitor upcoming EU policies and flag issues and opportunities, educate and inform EU lawmakers about Open Source and its benefits, represent the OSI at EU-level events and conferences, and provide analysis and support to the OSI’s board and members on EU policy issues. He will also work closely with other Open Source foundations and organizations to make sure the voice of the Open Source community is heard at an EU level.
Jordan comes well-equipped with the experience he needs to excel in this role. He worked for three years with members of the European Parliament. In his previous position as a senior parliamentary policy advisor, he fought for the Open Source community on laws such as the AI Act, European Digital Identity, Data Act, Product Liability Directive, and Cyber-Resilience Act. He is a strong advocate for the Public Money–Public Code principle and a long-time user of and occasional contributor to Open Source software. He speaks English, French and German.
When asked about his vision for the future of Open Source, Jordan replied, “A world where Open Source is the rule — not the exception, and where developers and communities are consistently supported, listened to and valued.”
Jordan says, “I’m looking forward to being able to devote more time to raising awareness about Open Source among lawmakers and to bringing together the Open Source community and EU lawmakers so that new laws better reflect the needs of the Open Source community.”
Please join me in welcoming Jordan to the team.
Open Source AI Definition – Weekly update September 9
Week 36 summary
Draft v.0.0.9 of the Open Source AI Definition is available for comments- -@Shamar agrees with @thesteve0 and emphasizes that AI systems consist of two parts: a virtual machine (architecture) and the weights (the executable software). He argues that while weights are important, they are not sufficient to study or fully understand an AI model. For a system to be truly Open Source, it must provide all the data used to recreate an exact copy of the model, including random values used during the process. Without this, the system should not be labeled Open Source, even if the weights are available under an open-source license. Shamar suggests calling such systems “freeware” instead and ensuring the Open Source AI Definition aligns with the Open Source Definition.
- @jberkus questions whether creating an exact copy of an AI system is truly possible, even with access to all the training data, or if slight differences would always exist.
- @shujisado explains that under Japan’s copyright law, AI training on publicly available copyrighted works is permissible, but sharing the datasets created during training requires explicit permission from copyright holders. He notes that while AI training within legal limits may be allowed in many jurisdictions, making all training data freely available is unlikely. He adds that the current Open Source AI Definition strikes a reasonable balance given global intellectual property rights but suggests that more specific language might help clarify this further.
- @marianataglio suggests including hardware specifications, training time, and carbon footprint in the Open Source AI Definition to improve transparency. She believes this would enhance reproducibility, accessibility, and collaboration, while helping practitioners estimate computational costs and optimize models for more efficient training.
- The fifthteenth edition of our town hall meetings was held on the 6th of September. If you missed it, the recording and slides can be found here.
- @Alek_Tarkowski agrees with @arandal on the importance of situating Open Source AI within broader open movements like open data. He suggests cooperation with organizations like Creative Commons should go beyond licensing standards to include data governance, which remains an undeveloped area.
- @Alek_Tarkowski finds the idea of requiring source data to follow Open Source licenses conceptually interesting, likening it to “upstream copyleft,” but notes traditional copyleft frameworks may not suit AI development.
- @arandal clarifies that the proposal is an evolution of software freedom principles, not a direct extension of traditional copyleft, similar to how AGPL addressed gaps left by earlier licenses. They further mention that discussions on these approaches are ongoing across various organizations, though formal publications are limited.
- @Senficon highlights a concern from the open science community that, while EU copyright law allows reproductions of protected content for research, it restricts making the research corpus available to third parties. This limits research reproducibility and open access, as it aims to protect rights holders’ revenue.
- @kjetilk agrees with the observation but questions the assumption that making content publicly available would significantly harm rights holders’ revenue. He believes such policies should be based on solid evidence from extensive research.
Members Newsletter – September 2024
It’s been a busy couple of months, and things are going to stay that way as we approach All Things Open in October. Version 0.0.9 of the Open Source AI Definition has been released after collecting months of community feedback.
We’re continuing our march towards a stable release by the end of October 2024, at All Things Open. Get involved by joining the discussion on the forum, finding OSI staff around the world, and online at the weekly town halls. The community continues iterating through drafts after meeting diverse stakeholders at the worldwide roadshow, collecting feedback and carefully looking for new arguments in dissenting opinions. All thanks to a grant by the Alfred P. Sloan Foundation. We also need to decide how to best address the reviews of new licenses for datasets, documentation and the agreements governing model parameters.
The lively conversations will continue at conferences, town halls, and online. The first two stops were at AI_dev and Open Source Congress. Other events are planned to take place in Africa, South America, Europe and North America.
On a separate delightful note, the Open Source community got some welcome news on August 29, as Elastic returned to the community by adding the AGPL licensing option for Elasticsearch and Kibana. This decision is confirmation that shipping software with licenses that comply with the Open Source Definition is valuable—to the maker, to the customer, and to the user. Elastic’s choice of a strong copyleft license signals the continuing importance of that license and its dual effect: one, it’s designed to preserve the user’s freedoms downstream, and two, it also grants strong control over the project by the single-vendor developers.
We’re encouraged to see Elastic return to the Open Source community. And who knows… maybe others will follow suit!
Stefano Maffulli
Executive Director, OSI
I hold weekly office hours on Fridays with OSI members: book time if you want to chat about OSI’s activities, if you want to volunteer or have suggestions.
News from the OSI Community input drives the new draft of the Open Source AI DefinitionFrom the Research and Advocacy program
The Open Source AI Definition v0.0.9 has been released and collaboration continues at in-person events and in the online forums. Read what changes have been made, what to do next and how to get involved. Read more.
Three things I learned at KubeCon + AI_Dev China 2024From the Research and Advocacy program
KubeCon China 2024 was a whirlwind of innovation, community and technical deep dives. As it often happens at these community events, I was blown away by the energy, enthusiasm and sheer amount of knowledge being shared. Read more.
Highlights from our participation at Open Source CongressFrom the Research and Advocacy program
The Open Source Initiative (OSI) proudly participated in the Open Source Congress 2024, held from August 25-27 in Beijing, China. This event was a pivotal gathering for key individuals in the Open Source nonprofit community, aiming to foster collaboration, innovation, and strategic development within the ecosystem. Read more.
OSI in the news Elasticsearch is open source, againOSI at elastic.co
“Being able to call Elasticsearch and Kibana Open Source again is pure joy.” — Shay Banon, Elastic Founder and CTO. Read more.
Meta is accused of bullying the open source communityOSI at The Economist
Purists are pushing back against Meta’s efforts to set its own standard on the definition of open-source AI. Stefano Maffulli, head of the OSI, says Mr Zuckerberg “is really bullying the industry to follow his lead”. Read more.
Debate over “open source AI” term brings new push to formalize definitionOSI at Ars Technica
The Open Source Initiative (OSI) recently unveiled its latest draft definition for “open source AI,” aiming to clarify the ambiguous use of the term in the fast-moving field. The move comes as some companies like Meta release trained AI language model weights and code with usage restrictions while using the “open source” label. This has sparked intense debates among free-software advocates about what truly constitutes “open source” in the context of AI. Read more.
Other Highlights- Open source AI now has a definition. This it what it means and why it’s still tricky (Euro News)
- We’re a big step closer to defining open source AI – but not everyone is happy (ZDNet)
- We finally have a definition for open-source AI (MIT Technology Review)
- We’re a long way from truly open-source AI (Financial Times)
- Like it or not, this open source AI definition take a giant step forward (ZDNet)
- Mozilla Foundation: Celebrating An Important Step Forward For Open Source AI
- Python Software Foundation: Python Developers Survey 2023 Results
- OpenJS Foundation: OpenJS Foundation’s Leader Details the Threats to Open Source
- Linux Foundation: How open source is steering AI down the high road
- ClearlyDefined at SAP: enhancing Open Source license compliance through Open Source data
- Open Source visibility hacks — No icky marketing needed
- So, You Have Your 20-Page Open Source Strategy Doc. Now What?
- Pajamas to profit: Launch your Open Source empire
- Demystifying Open Source as a Business
The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process.
2024 Generative AI SurveyThis survey aims to understand the deployment, use, and challenges of generative AI technologies in organizations and the role of open source in this domain. Take survey here.
Events Upcoming events- India FOSS (September 7-8, 2024 – Bengaluru)
- Open Source Summit Europe (September 16-18, 2024 – Vienna)
- Nerdearla Argentina (September 24-28, 2024 – Buenos Aires)
- Hacktoberfest (October – Online)
- SOSS Fusion (October 22-23, 2024 – Atlanta)
- Open Community Experience (October 22-24, 2024 – Mainz)
- All Things Open (October 27-29 – Raleigh)
- Nerdearla Mexico (November 7-9, 2024 – Mexico City)
- OpenForum Academy Symposium (November, 13-14, 2024 – Boston)
- SCALE 22x (March 6-9, 2025 – Pasadena)
- Consul Conference (February, 4-6, 2025 – Las Palmas de Gran Canaria)
- Nerdearla Mexico (November 7-9, 2024 – Mexico City)
- Mercado Libre
Interested in sponsoring, or partnering with, the OSI? Please see our Sponsorship Prospectus and our Annual Report. We also have a dedicated prospectus for the Deep Dive: Defining Open Source AI. Please contact the OSI to find out more about how your company can promote open source development, communities and software.
Support OSI by becoming a member!Let’s build a world where knowledge is freely shared, ideas are nurtured, and innovation knows no bounds!
Highlights from our participation at Open Source Congress 2024
The Open Source Initiative (OSI) proudly participated in the Open Source Congress 2024, held from August 25-27 in Beijing, China. This event was a gathering for key individuals in the Open Source nonprofit community, aiming to foster collaboration, innovation, and strategic development within the ecosystem. Here are some highlights from OSI’s participation at the event.
Panel: Collaboration between Open Source OrganizationsStefano Maffulli, OSI’s Executive Director, played an important role in the panel on “Collaboration between Open Source Organizations.” This session, moderated by Daniel Goldscheider (Executive Director, OpenWallet Foundation) and Chris Xie (Board Advisor, Linux Foundation Research), brought together influential leaders, including Keith Bergelt (CEO, Open Invention Network), Bryan Che (Advisory Board Member, Software Heritage Foundation), Mike Milinkovich (Executive Director, Eclipse Foundation), Rebecca Rumbul (Executive Director, Rust Foundation), Xiaohua Xin (Deputy Secretary-General, OpenAtom Foundation), and Jim Zemlin (Executive Director, Linux Foundation). The panel discussed the importance of collaboration in addressing the challenges faced by the Open Source ecosystem and explored ways to strengthen inter-organizational ties.
Fireside Chat: Datasets, Privacy, and CopyrightStefano Maffulli also led a fireside chat on “Datasets, Privacy, and Copyright” in the context of Open Source AI along with Donnie Dong (Steering Committee Member, Digital Asia Hub; Senior Partner, Hylands Law Firm). This session was particularly relevant given the growing concerns around AI and the legal implications of creating and distributing large datasets. The discussion provided valuable insights into how these issues intersect with Open Source principles and what steps the community can take to address them responsibly. Some questions addressed included the use of copyrighted material in training datasets; fair use in the context of AI training and content generation; and China’s AI regulatory framework.
Talk: The Open Source AI DefinitionOSI’s involvement was further highlighted by Stefano Maffulli’s talk on “The Open Source AI Definition,” where he announced version 0.0.9 of the Open Source AI Definition (OSAID), a significant milestone resulting from a multi-year, global, and multi-stakeholder process. This version reflects the collective input of a diverse range of experts and community members who participated in extensive co-design workshops and public consultations, ensuring that the definition is robust, inclusive, and aligned with the principles of openness. Maffulli emphasized the importance of the “4 Freedoms of Open Source AI”—Use, Study, Modify, and Share—as foundational principles guiding the development of AI technologies. The session was particularly crucial for gathering feedback from the community in China, providing a platform for discussing the practical implications of the OSAID in different cultural and regulatory contexts.
Panel: The Future of Open Source CongressDeborah Bryant, OSI’s US Policy Director, moderated a pivotal panel discussion on “The Future of Open Source Congress: Converting Ideas to Shared Action.” This session focused on how the community can transform discussions into actionable strategies, ensuring the continued growth and impact of Open Source globally.
Other highlights from the eventThe “Unlocking Innovation: Open Strategies in Generative AI” panel led by Anni Lai (Chair of Generative AI Commons; Board member of LF AI & Data; Head of Open Source Operations, Futurewei) explored how openness is essential for advancing Generative AI innovation, democratizing access, and ensuring ethical AI practices. Panelists Richard Sikang Bian (Outreach Chair, LF AI & Data; Head of OSPO, Ant Group), Richard Lin (Member, OpenDigger Community; Head of Open Source, 01.ai), Ted Liu (Co-founder, KAIYUANSHE), and Zhenhua Sun (China Workgroup Chair, OpenChain; Open Source Legal Counsel, ByteDance) delved into the challenges of the Open Source generative AI landscape, such as “open washing,” inconsistent definitions, and the complexities of licensing. They highlighted the need for clear, standardized frameworks to define what truly constitutes Open Source AI, emphasizing that openness fosters transparency, accelerates learning, and mitigates biases. The panelists called for increased collaboration among stakeholders to address these challenges and further develop Open Source AI standards, ensuring that AI technologies are transparent, ethical, and widely adoptable.
In her closing keynote at the Open Source AI track, Amreen Taneja, Standards Lead at the Digital Public Goods Alliance (DPGA), emphasized the critical role of Open Source AI in advancing public good and supporting the Sustainable Development Goals (SDGs). She explained that Digital Public Goods (DPGs) are digital technologies made freely available to benefit society and highlighted the importance of OSAI in democratizing access to powerful AI technologies. Taneja outlined the DPGA’s efforts to align AI with public interests, including updating the DPG Standard to better accommodate AI, ensuring transparency in AI development, and promoting responsible AI practices that prioritize privacy and avoid harm. She stressed the need for rigorous evaluation, clear ownership, open licensing, and platform independence to drive the adoption of AI DPGs, ultimately aiming to create AI systems that are ethical, transparent, and beneficial for all.
Quotes from OSI Board and affiliatesAttending the Open Source Congress was really inspiring. Over two days, we participated in intensive discussions and exchanges with dozens of Open Source foundations and organizations worldwide, which was incredibly beneficial. I believe this will foster broader cross-community collaboration globally. I hope the conclusion of the second Open Source Congress marks the beginning of ongoing cooperation, allowing our “community of communities” to maintain regular communication and exchange.
Nadia Jiang – Board Chair of KAIYUANSHE
Open Source development experience is all about two words: consensus and antifragile decision-making process. The most valuable part of this event is seeing and listening to all the executive directors, open-source leaders in the room, and being very comfortable with the information density and the constructiveness of the discussions. Towards the end of the day, what people care about are not fundamentally different and there are indeed really difficult questions to resolve. I feel the world becomes slightly better after this OSC, and that means a lot to have an event like this.
Richard Bian – Head of Ant Group OSPO; Outreach Chair, Linux Foundation AI & Data
Open Source is the cornerstone of innovation, transparency, and collaboration, driving solutions that benefit everyone. The Open Source Congress 2024 represented a significant step forward in fostering alignment and building consensus within the open source community. By bringing together diverse voices and ideas, it amplified our collective efforts to create a more open, inclusive, and impactful digital ecosystem for the future.
Amreen Taneja – Standards Lead, Digital Public Goods Alliance
Stefano Maffulli with Board Directors of KAIYUANSHE: Emily Chen, Nadia Jiang (photo credits), and Ted Liu. ConclusionOSI’s active participation in the Open Source Congress 2024 reinforced its leadership role in the global Open Source community. By engaging in critical discussions, leading panels, and contributing to the future direction of Open Source initiatives, OSI continues to shape the landscape of Open Source development, ensuring that it remains inclusive, innovative, and aligned with the values of the global community.
This event marked another successful chapter in OSI’s ongoing efforts to drive collaboration and innovation in the Open Source world. We extend our sincere thanks to the organizers of OSC and the Open Source community in China for creating a platform that brought together a diverse and dynamic group of stakeholders, enabling meaningful discussions and progress. We look forward to continuing these conversations and turning ideas into action in the years to come.
Open Source AI Definition – Weekly update September 2nd
- @mkai added concerns about how OSI will address AI-generated content from both open and closed source models, given current legal rulings that such content cannot be copyrighted. He also suggests clarifying the difference between licenses for AI model parameters and the model itself within the Open Source AI Definition.
- @shujisado added that while media coverage of the OSAID v0.0.9 release is encouraging, he is not supportive of the idea of an enforcement mechanism to flag false open source AI. He believes this approach differs from OSI’s traditional stance and suggests it may be a misunderstanding.
- @jplorre added that while LINAGORA supports the proposed definition, they propose clarifying the term “equivalent system” to mean systems that produce the same outputs given identical inputs. They also suggest removing the specific reference to “tokenizers” in the definition, as it may not apply to all AI systems.
Draft v.0.0.9 of the Open Source AI Definition is available for comments
- @adafruit reconnects with @webmink and proposes updates to the Open Source AI Definition, including adding requirements for prompt transparency and data access during AI training. These updates aim to enhance the ability to audit, replicate, and modify AI models by providing detailed logs, documentation, and public access to prompts used during the training phase.
- @webmink appreciates the proposal but points out that it seems specific to a single approach, suggesting that it may need broader applicability.
- @thesteve0 criticizes the current definition, arguing that it does not grant true freedom to modify AI models because the weights, which are essential for using the model, cannot be reproduced without access to both the original data and code. He suggests that models sharing only their weights, especially when built on proprietary data, should be labeled as “open weights” rather than “open source.” He also expresses concern about the misuse of the “open source” label by some AI models, citing specific examples where the term is being abused.
- @pranesh added that it might be helpful to explicitly state that the governance of open-source AI is out of scope for OSAID, but also notes that neither the OSD nor the free software definition explicitly mention governance, so it may not be necessary.
- @kjetilk added that while governance issues have traditionally been unspoken, this unspoken nature is a key problem that needs addressing. He suggests that OSI should explicitly declare governance out of scope to allow others to take on this responsibility.
- @mjbommar added support for making an official statement that OSI does not intend to control governance, noting concerns that some might fear OSI is moving towards a walled governance approach. He references past regrets about not controlling the “open source” trademark as a means to combat open-washing.
- @nick added assurance that OSI has no intention of creating a walled governance garden, reaffirming the organization’s long-standing position against such control.
- @shujisado added that there seems to be a consensus within the OSAID process that governance is out of scope, and notes that related statements have already been moved to the FAQ section in recent versions.
- @pranesh mentions that, from a legal perspective, the percentage of infringement matters, citing the “de minimis” doctrine and defenses like “fair use” that consider the amount and purpose of infringement. He emphasizes that copyright laws in different jurisdictions vary, and not all recognize the same defenses as in the US.
- @mjbommar argues that the scale and nature of AI outputs make the “de minimis” defense irrelevant, especially when AI models generate significant amounts of copyrighted content. He stresses that the economic impact of AI-generated content is a key factor in determining whether it qualifies as transformative or infringes copyright.
- @shujisado highlights that in Japan, using copyrighted works for AI training is generally treated as an exception under copyright law, a stance that is also being adopted by neighboring East Asian countries. He suggests that approaches like the EU Directive are unlikely to become mainstream in Asia.
- @mjbommar acknowledges the global focus on US/EU laws but points out that many commonly used models are developed by Western organizations. He questions how Japan’s updated copyright laws align with international treaties like WCT/DMCA, expressing concern that they may allow practices that conflict with these agreements.
- @arandal emphasizes the importance of the Open Source Definition (OSD) as a unifying framework that accommodates diverse approaches within the open-source community. She argues that AI models, being a combination of source code and training data, should have their diversity in handling data explicitly recognized in the Open Source AI Definition. She proposes specific text changes to the draft to clarify that while some developers may be comfortable with proprietary data, others may not, and both approaches should be supported to ensure the long-term success of open-source AI.
- @mjbommar appreciates the spirit of Arandal’s proposal but adds that the OSI currently lacks specific licenses for data, which is why it is crucial for the OSI to collaborate with Creative Commons. Creative Commons maintains the ecosystem of “data licenses” that would be necessary under the proposed revisions to the Open Source AI Definition.
- @arandal agrees with the need for collaboration with organizations like Creative Commons, noting that this coordination is already reflected in checklist v. 0.0.9. She suggests that such collaboration is necessary even without the proposed revisions to ensure the definition accurately addresses data licensing in AI.
- @nick acknowledges the importance of working with organizations like Creative Commons and mentions that OSI is in ongoing communication with several relevant organizations, including MLCommons, the Open Future Foundation, and the Data and Trust Alliance. He highlights the recent publication of the Data Provenance Standards by the Data and Trust Alliance as an example of the kind of collaborative work that is being pursued.
- @mjbommar reiterates the need for explicit coordination with Creative Commons, arguing that the OSI cannot realistically finalize the Open Source AI Definition without such collaboration. He also suggests that the OSI should explore AI preference signaling and work with Creative Commons and SPDX/LF to establish shared standards, which should be part of the OSAID standard’s roadmap.
Join this week’s town hall to hear the latest developments, give your comments and ask questions.
Register for the townallEzequiel Lanza: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Ezequiel LanzaWhat’s your background related to Open Source and AI?
I’ve been working in AI for more than 10 years (Yes, before ChatGPT!). With a background in engineering, I’ve consistently focused on building and supporting AI applications, particularly in machine learning and data science. Over the years, I’ve contributed to and collaborated on various projects. A few years ago, I decided to pursue a master’s in data science to deepen my theoretical knowledge and further enhance my skills. Open Source has also been a significant part of my work; the frameworks, tools and community have continually drawn me in, making me an active participant in this evolving conversation for years.
What motivated you to join this co-design process to define Open Source AI?
AI owes much of its progress to Open Source, and it’s essential for continued innovation. My experience in both AI and Open Source spans many years, and I believe this co-design process offers a unique chance to contribute meaningfully. It’s not just about sharing my insights but also about learning from other professionals across AI and different disciplines. This collective knowledge and diverse perspectives make this initiative truly powerful and enriching, to shape the future of Open Source AI together.
Can you describe your experience participating in this process? What did you most enjoy about it, and what were some of the challenges you faced?
Participating in this process has been both rewarding and challenging. I’ve particularly enjoyed engaging with diverse groups and hearing different perspectives. The in-person events, such as All Things Open in Raleigh in 2023, have been valuable for fostering direct collaboration and building relationships. However, balancing these meetings with my work duties has been challenging. Coordinating schedules and managing time effectively to attend all the relevant discussions can be demanding. Despite these challenges, the insights and progress have made the effort worthwhile.
Why do you think AI should be Open Source?
We often say AI is everywhere, and while that’s partially true, I believe AI will be everywhere, significantly impacting our lives. However, AI’s full potential can only be realized if it is open and accessible to everyone. Open Source AI should also foster innovation by enabling developers and researchers from all backgrounds to contribute to and improve existing models, frameworks and tools, allowing freedom of expression. Without open access, involvement in AI can be costly, limiting participation to only a few large companies. Open Source AI should aim to democratize access, allowing small businesses, startups and individuals to leverage powerful tools that might otherwise be out of reach due to cost or proprietary barriers.
What do you think is the role of data in Open Source AI?
Data is essential for any AI system. Initially, from my ML bias perspective, open and accessible datasets were crucial for effective ML development. However, I’ve reevaluated this perspective, considering how to adapt the system while staying true to Open Source principles. As AI models, particularly GenAI like LLMs, become increasingly complex, I’ve come to value the models themselves. For example, Generative AI requires vast amounts of data, and gaining access to this data can be a significant challenge.
This insight has led me to consider what I—whether as a researcher, developer or user—truly need from a model to use/investigate it effectively. While understanding the data used in training is important, having access to specific datasets may not always be necessary. In approaches like federated learning, the model itself can be highly valuable while keeping data private, though understanding the nature of the data remains important. For LLMs, techniques such as fine-tuning, RAG and RAFT emphasize the benefits of accessing the model rather than the original dataset, providing substantial advantages to the community.
Sharing model architecture and weights is crucial, and data security can be maintained through methods like model introspection and fine-tuning, reducing the need for extensive dataset sharing.
Data is undoubtedly a critical component. However, the essence of Open Source AI lies in ensuring transparency, then the focus should be on how data is used in training models. Documenting which datasets were used and the data handling processes is essential. This transparency helps the community understand the origins of the data, assess potential biases and ensure the responsible use of data in model development. While sharing the exact datasets may not always be necessary, providing clear information about data sources and usage practices is crucial for maintaining trust and integrity in Open Source AI.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
Of course, it changed and evolved – that’s what a thought process is about. I’d be stubborn if I never changed my perspective along the way. I’ve often questioned even the most fundamental concepts I’ve relied on for years, avoiding easy or lazy assumptions. This thorough process has been essential in refining my understanding of Open Source AI. Engaging in meaningful exchanges with others has shown me the importance of practical definitions that can be implemented in real-world scenarios. While striving for an ideal, flawless definition is tempting, I’ve found that embracing a pragmatic approach is ultimately more beneficial.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
As I see it, the Open Source AI Definition will support the growth, and it will be the first big step. The primary benefit of having a clear definition of Open Source AI will be increased clarity and consistency in the field. This will enhance collaboration by setting clear standards and expectations for researchers, developers and organizations. It will also improve transparency by ensuring that AI models and tools genuinely follow Open Source principles, fostering trust in their development and sharing.
A clear definition will create standardized practices and guidelines, making it easier to evaluate and compare different Open Source AI projects.
What do you think are the next steps for the community involved in Open Source AI?
The next steps for the community should start with setting up a certification process for AI models to ensure they meet certain standards. This could include tools to help automate the process. After that, it would be helpful to offer templates and best practice guides for AI models. This will support model designers in creating high-quality, compliant systems and make the development process smoother and more consistent.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Three things I learned at KubeCon + AI_Dev China 2024
KubeCon China 2024 was a whirlwind of innovation, community and technical deep dives. As it often happens at these community events, I was blown away by the energy, enthusiasm and sheer amount of knowledge being shared. Here are three key takeaways that stood out to me:
1. The focus on AI and machine learningAI and machine learning are increasingly integrated into cloud-native applications. At KubeCon China, I saw numerous demonstrations of how these technologies are being used to automate tasks, optimize resource utilization and improve application performance. From AI-powered observability tools to machine learning-driven anomaly detection, the potential for AI and ML in the cloud-native space is astounding.
Mer Joyce and Anni Lai introduced the new draft of the Open Source AI Definition (v.0.0.9) and the Model Openness Framework.
We also saw a robot on stage demonstrating that teaching a robotic arm to use a spoon to help disabled people is not a programming issue but a data issue. This was probably my biggest learning moment: A robot can be “taught” to execute tasks by imitating humans. Follow Xavier Tao and the dora-rs project.
2. The growing maturity of cloud-native technologiesIt’s clear that cloud-native technologies have come of age. From Kubernetes adoption to the rise of serverless platforms and edge computing, the ecosystem is thriving. In his keynote, Chris Aniszczyk announced over 200 projects are hosted by the Cloud Native Computing Foundation and half of the contributors are not in the US. The conference showcased a wide range of tools, frameworks and use cases that demonstrate the versatility and scalability of cloud-native architectures.
The presentation by Kevin Wang (Huawei) and Saint Jiang (NIO) showed how Containerd, Kubernetes and KubeEdge power the transition to electric vehicles. Modern cars are computers… no, cars are full datacenters on wheels, a collection of sensors feeding distributed applications to optimize battery usage, feeding into centralized programs to constantly improve the whole mobility system.
3. AI technology is removing the language barrierI was absolutely amazed by being able to follow the keynote sessions delivered in Chinese. I don’t speak Chinese but I could read the automatic translation in real time superimposed on the slides behind the speakers. This technology is absolutely jaw-droppingly amazing! Within a few years, there won’t be a career for simultaneous translators or for live transcribers.
Final thoughtsKubeCon + AI_Dev China was a testament to the power of Open Source collaboration hosted in one of the most amazing regions of the world. The conference brought together developers, operators and end-users from around the world to share their experiences, best practices and contributions to Open Source projects. This collaborative spirit is essential for driving innovation and ensuring the long-term success of cloud-native technologies.
Open Source AI – Weekly update August 26
As we move toward the release of the first-ever Open Source AI Definition in October at All Things Open, the publication of the 0.0.9 draft brings us one step closer to realizing this goal.
- OSAID 0.0.9 draft definition is live!
- Changelog includes:
- New Feature: Clarified Open Source Models and Weights
- Added a new paragraph under “What is Open Source AI” to define “system” as including both models and weights.
- Clarified that all components of a larger system must meet the standard.
- Updated paragraph after the “share” bullet to emphasize this point.
- New Section: Open Source Models and Open Source Weights
- Added descriptions of components for both models and weights in machine learning systems.
- Edited subsequent paragraphs to eliminate redundancy.
- Training Data: Defined as a Benefit, Not a Requirement
- Defined open, public, and unshareable non-public training data.
- Explained the role of training data in studying AI systems and understanding biases.
- Emphasized extra requirements for data to advance openness, especially in private-first areas like healthcare.
- Separation of Checklist
- The Checklist is now a separate document from the main Definition.
- Fully aligned Checklist content with the Model Openness Framework (MOF).
- Terminology Changes
- Replaced “Model” with “Weights” under “Preferred form to make modifications” for consistency.
- Explicit Reference to Recipients of the Four Freedoms
- Added specific references to developers, deployers, and end users of AI systems.
- Credits and References
- Incorporated credit to the Free Software Definition.
- Added references to conditions of availability of components, referencing the Open Source Definition.
- New Feature: Clarified Open Source Models and Weights
- Initial reactions on the forum:
- @shujisado praises the updates in version 0.0.9, particularly the decision to separate the checklist from the main document, which clarifies the intent behind OSAID. He also supports the separation of “code” and “weights,” noting that in Japan, “code” clearly falls under copyright, making this distinction logical. He acknowledges revisions in the checklist that consider the importance of complete datasets, even though he disagrees with making datasets mandatory.
- Comments on the draft on HackMD
- @Joshua Gay adds that instead of narrowing the focus to machine-learning systems, the emphasis should be on “parameters” as a whole since weights are just one type of parameter. He suggests a rewrite that highlights making model parameters, such as weights and other settings, available under OSI-approved terms, with examples across various AI models.
- He further suggests using broader language that covers more AI systems instead of narrower terminology. Specifically, he proposes replacing “Open Source models and Open Source weights” with “Open Source models and Open Source parameters,” and using “AI systems” instead of “machine learning systems.” Additionally, he recommends redefining an AI model to include architecture, parameters like weights and decision boundaries, and inference code, while referring to AI parameters as configuration settings that produce outputs from inputs.
- Under “Open Source models and Open Source weights”, @shujisado adds that the last paragraph titled “Open Source models and Open Source weights” actually explains “AI model” and “AI weights,” leading to a mismatch between the title and content, and notes that these terms are not used elsewhere in the definition.
- Under “Preferred form to make modifications to machine-learning systems”, @shujisado suggests some grammatical corrections.
- @Joshua Gay adds that instead of narrowing the focus to machine-learning systems, the emphasis should be on “parameters” as a whole since weights are just one type of parameter. He suggests a rewrite that highlights making model parameters, such as weights and other settings, available under OSI-approved terms, with examples across various AI models.
- Next steps
- The OSI has recently presented at the following events:
- Hong Kong for AI_dev, August 21-23
- Beijing for Open Source Congress, August 25-27.
- Iterate Drafts: Continue refining drafts with feedback from the worldwide roadshow, considering new dissenting opinions.
- Review Licenses: Decide on the best approach for reviewing new licenses for datasets, documentation, and model parameters.
- Enhance FAQ: Continue improving the FAQ to address emerging questions.
- Post-Stable Release Plan: Establish a process for reviewing and updating future versions of the Open Source AI Definition.
- The OSI has recently presented at the following events:
- Get involved:
- Join the forum and share your opinion.
- Leave a comment on the draft v.0.0.9 with precise feedback.
- Follow the weekly recaps and subscribe to our monthly newsletter.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions, and share your thoughts. The next is on September 6.
- Join the workshops and scheduled conferences
- @Kjetilk points out the legal distinction between using copyrighted works for AI training (reproduction) and incorporating them into publishable datasets, questioning the fairness of allowing exploitative models without compensation while potentially banning those that benefit society.
- @Shujisadoclarifies that compensation for copyrighted works used in AI training is possible for both open source and closed models, distinguishing it from “royalty,” and notes that Japan’s copyright law exempts such uses for machine learning.
- @Kjetilk reiterates the relevance of “royalty” for compensation in closed, non-published models, suggesting it makes sense under copyright law if required, but if not, it could benefit science and the arts.
Members Newsletter – August 2024
The lively conversation about the role of data in building and modifying AI systems will continue as the OSI travels to China this month for AI_dev (August 21-23 in Hong Kong) and Open Source Congress (August 25-27 in Beijing). The OSI has been able to chime in on news stories on the topic, several of which are linked here in the newsletter.
Last month the OSI was at the United Nations in New York City for OSPOs for Good, an event that covered key areas of open source policy, as well as emerging examples of ‘Open Source for good’ from across the globe. I participated in a panel on Open Source AI.
Creating an Open Source AI Definition has been an arduous task over the past couple of years, but we know the importance of creating this standard so the freedoms to use, study, share and modify AI systems can be guaranteed. Those are the core tenets of Open Source, and it warrants the dedicated work it has required. Please read about the people who have played key roles in bringing the Definition to life in our Voices of Open Source AI Definition on the blog.
Stefano Maffulli
Executive Director, OSI
I hold weekly office hours on Fridays with OSI members: book time if you want to chat about OSI’s activities, if you want to volunteer or have suggestions.
News from the OSI OSI at the United Nations OSPOs for GoodFrom the Research and Advocacy program
Earlier this month the Open Source Initiative participated in the “OSPOs for Good” event promoted by the United Nations in NYC. Read more.
The Open Source Initiative joins CMU in launching Open Forum for AI: A human-centered approach to AI developmentFrom the Research and Advocacy program
The Open Source Initiative (OSI) is pleased to share that we are joining the founding team of Open Forum for AI (OFAI), an initiative designed by Carnegie Mellon University (CMU). Read more
GUAC adopts license metadata from ClearlyDefinedFrom the License and Legal program
The software supply chain just gained some transparency thanks to an integration of the Open Source Initiative (OSI) project, ClearlyDefined, into GUAC (Graph for Understanding Artifact Composition), an OpenSSF project from the Linux Foundation. Read more.
Better identifying conda packages with ClearlyDefinedFrom the License and Legal program
ClearlyDefined now provides a new harvester implementation for conda, a popular package manager with a large collection of pre-built packages for various domains, including data science, machine learning, scientific computing and more. Read more.
OSI in the news Can AI even be open source? It’s complicatedOSI at ZDNet
AI can’t exist without open source, but the top AI vendors are unwilling to commit to open-sourcing their programs and data sets. To complicate matters further, defining open-source AI is a messy issue that has yet to be settled. Read more.
Open Source AI: What About Data Transparency?OSI at The New Stack
AI uses both code and data, and this combination continues to be a challenge for open source, said experts at the United Nations OSPOs for Good Conference. Read more.
A new White House report embraces open-source AIOSI at ZDNet
The National Telecommunications and Information Administration (NTIA) issued a report supporting open-source and open models to promote innovation in AI, while emphasizing the need for vigilant risk monitoring. Read more.
With Open Source Artificial Intelligence, Don’t Forget the Lessons of Open Source SoftwareOSI at CISA
While there is not yet a consensus on the definition of what constitutes “open source AI”, the Open Source Initiative, which maintains the “Open Source Definition” and a list of approved OSS licenses, has been “driving a multi-stakeholder process to define an ‘Open Source AI’”. Read more.
Meta inches toward open source AI with new LLaMA 3.1OSI at ZDNet
Is Meta’s 405 billion parameter model really open source? Depends on who you ask. Here’s how to try out the new engine for yourself. Read more.
Other news News from OSI affiliates- Mozilla Foundation: Mozilla’s Policy Vision for the new EU Mandate: Advancing Openness, Privacy, Fair Competition, and Choice for all
- OASIS Open: The biggest names in AI have teamed up to promote AI security
- Apache Software Foundation, Eclipse Foundation, Linux Foundation: How open source attracts some of the world’s top innovators
- Eclipse Foundation: The Eclipse Foundation Announces Agenda and Keynote Speakers for Open Community Experience (OCX 2024), Europe’s Premier Event for Open Source Innovation
- Open Source takes center stage at United Nations
- Open Source in Europe: Facing the regulatory challenge
- Open Source projects vs products: A strategic approach
The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process.
7th annual OSPO and Open Source Management SurveyThe TODO Group and Linux Foundation Research, in partnership with Cisco, NGINX, Open Source Initiative, InnerSource Commons, and CHAOSS, are excited to be launching the 7th annual OSPO and Open Source Management survey! Take survey here.
2024 Open Source Software Funding SurveyThis survey tries to better understand how organizations fund, contribute to, and support open source software projects. This survey is a collaboration between GitHub, Inc., the Linux Foundation, and researchers from Harvard University. Take survey here.
Events Upcoming events- AI_dev China (August 21-23, 2024 – Hong Kong)
- Open Source Congress (August 25-27, 2024 – Beijing)
- Open Source Summit Europe (September 16-18, 2024 – Vienna)
- Nerdearla Argentina (September 24-28, 2024 – Buenos Aires)
- SOSS Fusion (October 22-23, 2024 – Atlanta)
- Open Community Experience (October 22-24, 2024 – Mainz)
- All Things Open (October 27-29 – Raleigh)
- OpenForum Academy Symposium (November, 13-14, 2024 – Boston)
- Cisco
- Microsoft
- Bloomberg
- SAS
- Intel
- Look to the right
Interested in sponsoring, or partnering with, the OSI? Please see our Sponsorship Prospectus and our Annual Report. We also have a dedicated prospectus for the Deep Dive: Defining Open Source AI. Please contact the OSI to find out more about how your company can promote open source development, communities and software.
Support OSI by becoming a member!Let’s build a world where knowledge is freely shared, ideas are nurtured, and innovation knows no bounds!
Community input drives the new draft of the Open Source AI Definition
A new version of the Open Source AI Definition has been released with one new feature and a cleaner text, based on comments received from public discussions and recommendations. We’re continuing our march towards having a stable release by the end of October 2024, at All Things Open. Get involved by joining the discussion on the forum, finding OSI staff around the world and online at the weekly town halls.
New feature: clarified Open Source model and Open Source weights- Under “What is Open Source AI,” there is a new paragraph that (1) identifies both models and weights/parameters as encompassed by the word “system” and (2) makes it clear that all components of a larger system have to meet the standard. There is a new sentence in the paragraph after the “share” bullet making this point.
- Under the heading “Open Source models and Open Source weights,” there is a description of the components for both of those for machine learning systems. We also edited the paragraph below those additions to eliminate some redundancy.
The role of training data is one of the most hotly debated parts of the definition. After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.
Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.
Data can be hard to share. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information, such as decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.
- Open training data (data that can be reshared) provides the best way to enable users to study the system, along with the preferred form of making modifications.
- Public training data (data that others can inspect as long as it remains available) also enables users to study the work, along with the preferred form.
- Unshareable non-public training data (data that cannot be shared for explainable reasons) gives the ability to study some of the systems biases and demands a detailed description of the data – what it is, how it was collected, its characteristics, and so on – so that users can understand the biases and categorization underlying the system.
OSI believes these extra requirements for data beyond the preferred form of making modifications to the AI system both advance openness in all the components of the preferred form of modifying the AI system and drive more Open Source AI in private-first areas such as healthcare.
Other changes- The Checklist is separated into its own document. This is to separate the discussion about how to identify Open Source AI from the establishment of general principles in the Definition. The content of the Checklist has also been fully aligned with the Model Openness Framework (MOF), allowing for an easy overlay.
- Under “Preferred form to make modifications,” the word “Model” changed to “Weights.” The word “Model” was referring only to parameters, and was inconsistent with how the word “model” is used in the rest of the document.
- There is an explicit reference to the intended recipients of the four freedoms: developers, deployers and end users of AI systems.
- Incorporated credit to the Free Software Definition.
- Added references to conditions of availability of components, referencing the Open Source Definition.
- Continue iterating through drafts after meeting diverse stakeholders at the worldwide roadshow, collect feedback and carefully look for new arguments in dissenting opinions.
- Decide how to best address the reviews of new licenses for datasets, documentation and the agreements governing model parameters.
- Keep improving the FAQ.
- Prepare for post-stable-release: Establish a process to review future versions of the Open Source AI Definition.
We will be taking draft v.0.0.9 on the road collecting input and endorsements, thanks to a grant by the Sloan Foundation. The lively conversation about the role of data in building and modifying AI systems will continue at multiple conferences from around the world, the weekly town halls and online throughout the Open Source community.
The first two stops are in Asia: Hong Kong for AI_dev August 21-23, then Beijing for Open Source Congress August 25-27. Other events are planned to take place in Africa, South America, Europe and North America. These are all steps toward the conclusion of the co-design process that will result in the release of the stable version of the Definition in October at All Things Open.
Creating an Open Source AI Definition is an arduous task over the past two years, but we know the importance of creating this standard so the freedoms to use, study, share and modify AI systems can be guaranteed. Those are the core tenets of Open Source, and it warrants the dedicated work it has required. You can read about the people who have played key roles in bringing the Definition to life in our Voices of Open Source AI Definition on the blog.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Mark Collier: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Mark CollierWhat’s your background related to Open Source and AI?
I’ve worked in Open Source most of my career, over 20 years, and have found it to be one of the greatest, if not the greatest, drivers of economic opportunity. It creates new markets and gives people all over the world access to not only use but to influence the direction of technologies. I started the OpenStack project and then the OpenStack Foundation, and later the Open Infrastructure Foundation. With members of our foundation from over 180 countries, I’ve seen firsthand how Open Source is the most efficient way to drive innovation. You get to crowdsource ideas from people all over the world, that are not just in one company or just in one country. We’ve certainly seen that with infrastructure in the cloud computing/edge computing era. AI is the next generation wave, with people investing literally trillions of dollars in building infrastructure, both physical and the software being written around it. This is another opportunity to embrace Open Source as a path to innovation.
Open Source drives the fastest adoption of new technologies and gives the most people an opportunity to both influence it and benefit economically from it, all over the world. I want to see that pattern repeat in this next era of AI.
What motivated you to join this co-design process to define Open Source AI?
I’m concerned about the potential of there not being credible Open Source alternatives to the big proprietary players in this massive next wave of technology. It will be a bad thing for humanity if we can only get state-of-the-art AI from two or three massive players in one or two countries. In the same way we don’t want to see just one cloud provider or one software vendor, we don’t want any sort of monopoly or oligopoly in AI; That really slows innovation. I wanted to be part of this co-design process because it’s actually not trivial to apply the concept of Open Source to AI. We can carry over the principles and freedoms that underlie Open Source software, like the freedom to use it without restriction and the ability to modify it for different use cases, but an AI system is not just software. A whole debate has been stirred up about whether data needs to be released and published under an Open Source friendly license to be considered Open Source AI. That’s just one consideration of many that I wanted to contribute to.
We have a very impressive group of people with a lot of diverse backgrounds and opinions working on this. I wanted to be part of the process not because I have the answers, but because I have some perspective and I can learn from all the others. We need to reach a consensus on this because if we don’t, the meaning of Open Source in the AI era will get watered down or potentially just lost all together, which affects all of Open Source and all of technology.
Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?
The process started as a mailing list of sorts and evolved to more of an online discussion forum. Although it hasn’t always been easy for me to follow along, the folks at OSI that have been shepherding the process have done an excellent job summarizing the threads and bringing key topics to the top. Discussions are happening rapidly in the forum, but also in the press. There are new models released nearly every day it seems, and the bar for what are called Open Source models is causing a lot of noise. It’s a challenge for anybody to keep up but overall I think it’s been a good process.
Why do you think AI should be Open Source?
The more important a technology is to the future of the economy and the more a technology impacts our daily lives, the more critical it is that it be Open Source. For economic and participation reasons, but also for security. We have seen time and time again that transparency and openness breeds better security. With more mysterious and complex technologies such as AI, Open Source offers the transparency to help us understand the decisions the technology is making. There have been a number of large players who have been lobbying for more regulation, making it more difficult to have Open Source AI, and I think that shows a very clear conflict of interest.
There is legislation out there that, if it gets passed, poses a real danger to not just Open Source AI, but Open Source in general. We have a real opportunity but also a real risk of there being conscious concentrations of power if state-of-the-art AI doesn’t fit a standard definition of Open Source. Open Source AI continues to be neck and neck with the proprietary models, which makes me optimistic.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
My personal definition of Open Source AI is not set in stone even having been through this process for over a year. Things are moving so quickly, I think we need to be careful that perfect doesn’t become the enemy of good. Time is of the essence as the mainstream media and the tech press report on models that are trained on billions of dollars worth of hardware, claiming to be Open Source when they clearly are not. I’ve become more willing to compromise on an imperfect definition so we can come to a consensus sooner.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
All the reasons people love Open Source are inherently the same reasons why people are very tempted to put an Open Source label on their AI; trust, transparency, they can modify it and build their business on it, and the license won’t be changed. Once we finalize and ratify the definition, we can start broadly using it in practice. This will bring some clarity to the market again. We need to be able to point to something very clear and documented if we’re going to challenge a technology that has been labeled Open Source AI. Legal departments of big companies working on massive AI tools and workloads want to know that their license isn’t going to be pulled out from under them. If the definition upholds the key freedoms people expect from Open Source, it will lead to faster adoption by all.
What do you think are the next steps for the community involved in Open Source AI?
I think Stefano from the OSI has done a wonderful job of trying to hit the conference circuit to share and collect feedback, and virtual participation in the process is still key to keeping it inclusive. I think the next step is building awareness in the press about the definition and market testing it. It’s an iterative process from there.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Update from the board of directors
The Chair of the Board of the OSI has acknowledged the resignation offered by Secretary of the Board, Aeva Black. The Chair and the entire Board would like to thank Black for their invaluable contribution to the success of OSI, as well of the entire Open Source Community, and for their service as board member and officer of the Initiative.
Jean-Pierre Lorre: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Jean-Pierre LorreWhat’s your background related to Open Source and AI?
I’ve been using Open Source technologies since the very beginning of my career and have been directly involved in Open Source projects for around 20 years.
I graduated in artificial intelligence engineering in 1985. Since then I have worked in a number of applied AI research structures in fields such as medical image processing, industrial plant supervision, speech recognition and natural language processing. My knowledge covers both symbolic AI methods and techniques and deep learning.
I currently lead a team of around fifteen AI researchers at LINAGORA. LINAGORA is an Open Source company.
What motivated you to join this co-design process to define Open Source AI?
The team I lead is heavily involved in the development of LLM generative models, which we want to distribute under an open license. I realized that the term Open Source AI was not defined and that the definition we had at LINAGORA was not the same as the one adopted by our competitors.
As the OSI is the leading organization for defining Open Source and there was a project underway to define the term Open Source AI, I decided to join it.
Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?
I participated in two ways: firstly, to provide input for the definition currently being drafted; and secondly, to evaluate LLM models with regard to the definition (I contributed to Bloom, Falcon and Mistral).
For the first item, my main difficulty was keeping up with the meandering discussions, which were very active. I didn’t manage to do so completely, but I was able to appreciate the summaries provided from time to time, which enabled me to follow the overall thread.
The second difficulty concerns the evaluation of the models: the aim of the exercise was to evaluate the consistency of OSAID version 0.8 on models that currently claim to be “Open Source.” Implementing the definition involves looking for information that is sometimes non-existent and sometimes difficult to find.
Why do you think AI should be Open Source?
Artificial intelligence models are expected to play a very important role in our professional lives, but also in our everyday lives. In this respect, the need for transparency is essential to enable people to check the properties of the models. They must also be accessible to as many people as possible, to avoid widening the inequalities between those who have the means to develop them and those who will remain on the sidelines of this innovation. Similarly, they might be adapted for different uses without the need for authorization.
The Open Source approach makes it possible to create a community such as the one created by LINAGORA, OpenLLM-Europe. This is a way for small players to come together to build the critical mass needed not only to develop models but also to disseminate them. Such an approach, which may be compared to that associated with the digital commons, is a guarantee of sovereignty because it allows knowledge and governance to be shared.
In short, they are the fruit of work based on data collected from as many people as possible, so they must remain accessible to as wide an audience as possible.
What do you think is the role of data in Open Source AI?
Data provides the basis for training models. It is therefore the pool of information from which the knowledge displayed by the model and the applications deduced from it will be drawn. In the case of an open model, the dissemination of as many elements as possible to qualify this data is a means of transparency that facilitates the study of the model’s properties; indeed, this data is likely to include cultural bias, gender, ethnic origin, skin color, etc. It is also a means of facilitating the study of the model’s properties. It also makes it easier to modify the model and its outputs.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
Yes, we initially thought that the provision of training data was a sine qua non condition for the design of truly Open Source models. Our basic assumption was that the model may be seen as a work derived from the data and that therefore the license assigned to the data, in particular the non-commercial nature, had an impact on the license of the model. As the discussions progressed, we realized that this condition was very restrictive and severely limited the possibility of developing models.
Our current analysis is that the condition defined in version 0.8 of the OSAID is sufficient to provide the necessary guarantees of transparency for the four freedoms and in particular the freedom to study the model underlying access to data. With regard to the data, it stipulates that “sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data” must be provided. Even if we can agree that this condition seems difficult to satisfy without providing the data sets, other avenues may be envisaged, in particular the provision of synthetic data. This information should make it possible to carry out almost all of the model’s studies.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
Having such a definition with clear, implementable rules will provide model suppliers with a concrete framework for producing models that comply with the ethics of the Open Source movement.
A collateral effect will be to help sort out the “wheat from the chaff.” In particular, to detect attempts at “Open Source washing.” This definition is therefore a structuring element for a company such as LINAGORA, which wants to build a sustainable business model around the provision of value-added AI services.
It should also be noted that such a definition is necessary for regulations such as the European IA Act, which defines exceptions for Open Source generative models. Such legislative construction cannot be satisfied with a fuzzy basis.
What do you think are the next steps for the community involved in Open Source AI?
The next steps that need to be addressed by the community concern firstly the definition of a certification process that will formalize the conformity of a model; this process may be accompanied by tools to automate it.
In a second phase, it may also be useful to provide templates of AI models that comply with the definition, as well as best practice guides, which would help model designers.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
GUAC adopts license metadata from ClearlyDefined
The software supply chain just gained some transparency thanks to an integration of the Open Source Initiative (OSI) project, ClearlyDefined, into GUAC (Graph for Understanding Artifact Composition), an OpenSSF project from the Linux Foundation. GUAC provides a comprehensive mapping of software packages, dependencies, vulnerabilities, attestations, and more, allowing organizations to achieve better compliance and security of their software supply chain.
GUAC offers the full view of the supply chainSoftware supply chain attacks are on the rise. Many tools are available to help generate software bills of materials (SBOMs), signed attestations and vulnerability reports, but they stop there, leaving users to figure out how they all fit together. GUAC provides an aggregated, queryable view across the whole software supply chain, not just one SBOM at a time.
GUAC is for developers, operations and security practitioners who need to identify and address problems in their software supply chain, including proactively managing dependencies and responding to vulnerabilities. GUAC provides supply chain observability with a graph view of the software supply chain and tools for performing queries to gain actionable insights.
GUAC enhanced with ClearlyDefined integrationThe latest version of GUAC (v0.8.0) now provides support for ClearlyDefined. GUAC will query the ClearlyDefined license metadata store to discover license information for packages, even when the SBOM does not include that information.
A ClearlyDefined certifier will listen on collector-subscriber for any pkg/src strings, then convert to ClearlyDefined coordinates, then query the API service for the definition. The user agent will be the same as existing outgoing GUAC requests GUAC/<version> (e.g. GUAC/v0.1.0).
A CertifyLegal node will be created using the “licensed” “declared” field from the definition. The expression will be copied and any license identifiers found will result in linked License noun nodes, created if needed. Type will be “declared”. Justification will be “Retrieved from ClearlyDefined”. Time will be the current time the information was retrieved from the API.
Similarly a node will be created using the “licensed” “facets” “core” “discovered” “expressions” field. Multiple expressions will be “AND”ed together. Type will be “discovered”, and other fields the same (Time, Justification, License links, etc).
The “licensed” “facets” “core” “attribution” “parties” array will be concatenated and stored in the Attribution field on CertifyLegal.
Optionally, “described” “sourceLocation” can be used to create a HasSourceAt GUAC node.
Thanks to the communityAlthough licenses don’t directly impact security, they are an important part of understanding the software supply chain. We would like to thank Parth Patel (Kusari), Jeff Mendoza (Kusari), Ben Cotton (Kusari), and Qing Tomlinson (SAP) for their support to get this feature implemented in GUAC. The ClearlyDefined community looks forward to working together with the GUAC community to help organizations worldwide to better achieve compliance and security of their software supply chain.
Deshni Govender: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Deshni GovenderWhat’s your background related to Open Source and AI?
I am the South Africa country focal point for the German Development Cooperation initiative “FAIR Forward – Artificial Intelligence for All” and the project strives for a more open, inclusive and sustainable approach to AI on an international level. More significantly, we seek to democratize the field of AI, to enable more robust, inclusive and self-determined AI ecosystems. Having worked in private sector and then now being in international development, my attention has been drawn to the disparity between the power imbalances of proprietary vs open and how this results in economic barriers for global majority, but also creates further harms and challenges for vulnerable populations and marginalized sectors, especially women. This fuelled my journey of working towards bridging the digital divide and digital gender gap through democratizing technology.
Some projects I am working on in this space include developing data governance models for African NLP (with Masakhane Foundation) and piloting new community-centered, equitable license types for voice data collection for language communities (with Mozilla).
What motivated you to join this co-design process to define Open Source AI?
I have experienced first hand the power imbalances that exist in geo-politics, but also in the context of economics where global minority countries shape the ‘global trajectory’ of AI without global voices. The definition of open means different things to different people / ecosystems / communities, and all voices should be heard and considered. Defining open means the values and responsibilities attached to it should be considered in a diverse manner, else the context of ‘open’ is in and of itself a hypocrisy.
Why do you think AI should be Open Source?
An enabling ecosystem is one that benefits all the stakeholders and ecosystem components. Inclusive efforts must be outlaid to explore and find tangible actions or potential avenues on how to reconcile the tension between openness, democracy and representation in AI training data whilst preserving community agency, diverse values and stakeholder rights. However, the misuse, colonization and misinterpretation of data continues unabated. Much of African culture and knowledge is passed down generations by story telling, art, dance and poetry and is done so verbally or through different ways of documentation, and in local manners and nuances of language. It is rarely digitized and certainly not in English. Language is culture and culture is context, yet somehow we find LLMs being used as an agent for language and context. Solutions and information are provided about and for communities but not with those communities, and the lack of transparency and post-colonial manipulation of data and culture is both irresponsible and should be considered a human rights violation.
Additionally, Open Source and open systems enable nations to develop inclusive AI policy processes so that policymakers from Global South countries can draw from peer experience on tackling their AI policies and AI-related challenges to find their own approaches to AI policy. This will also challenge dependence from and domination by western centric / Global North countries on AI policies to push a narrative or agenda on ‘what’ and ‘how’; i.e. Africa / Asia / LATAM must learn from us how to do X (since we hold the power, we can determine the extent and cost – exploitative). We aim for government self-determination and to empower countries, so that they may collectively have a voice on the global stage.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
My personal definition has not changed but it has been refreshing to witness the diverse views on how open is defined. The idea that behavior (e.g. of tech oligopolies) could reshape the way we define an idea or concept was thought-provoking. It means therefore that as emerging technology evolves, the idea of ‘open’ could change still in the future, depending on the trajectory of emerging technology and the values that society holds and attributes.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
A clear and more inclusive definition of Open Source AI would commerce a wave towards making data injustice, data invisibility, data extractivism, and data colonialism more visible and for which there exists repercussions. It would spur open, inclusive and responsible repositories of data, data use, and more importantly accuracy of use and interpretation. I am hoping that this would also spur innovative ways on how to track and monitor / evaluate use of Open Source data, so that local and small businesses are encouraged to develop in an Open Source while still being able to track and monitor players who extract and commercialize without giving back.
Ideally it would begin the process (albeit transitional) of bridging the digital divide between source and resource countries (i.e. global majority where data is collected from versus those who receive and process data for commercial benefit).
What do you think are the next steps for the community involved in Open Source AI?
If we make everything Open Source, it encourages sharing and use in developing and deploying, offers transparency and shared learning but enables freeriding. However the corollary is that closed models such as copyright prioritize proprietary information and commercialisation but can limit shared innovation, and does not uphold the concept of communal efforts, community agency and development. How do we quell this tension? I would like to see the Open Source community working to find practical and actionable ways in which we can make this work (open, responsible and innovative but enabling community benefit / remuneration).
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.