Open Source Initiative

Subscribe to Open Source Initiative feed Open Source Initiative
The steward of the Open Source Definition, setting the foundation for the Open Source Software ecosystem.
Updated: 11 hours 29 min ago

The Open Source AI Definition RC1 is available for comments

Wed, 2024-10-02 13:04

A little over a month after v.0.0.9, we have a Release Candidate version of the Open Source AI Definition. This was reached with lots of community feedback: 5 town hall meetings, several comments on the forum and on the draft, and in person conversations at events in Austria, China, India, Ghana, and Argentina.

There are three relevant changes to the part of the definition pertaining to the “preferred form to make modifications to a machine learning system.”

The feature that will draw most attention is the new language of Data Information. It clarifies that all the training data needs to be shared and disclosed. The updated text comes from many conversations with several individuals who engaged passionately with the design process, on the forum, in person and on hackmd. These conversations helped describe four types of data: open, public, obtainable and unshareable data, well described in the FAQ. The legal requirements are different for each. All are required to be shared in the form that the law allows them to be shared. 

Two new features are equally important. RC1 clarifies that Code must be complete, enough for downstream recipients to understand how the training was done. This was done to reinforce the importance of the training, both for transparency, security and other practical reasons. Training is where innovation is happening at the moment and that’s why you don’t see corporations releasing their training and data processing code. We believe, given the current status of knowledge and practice, that this is required to meaningfully fork (study and modify) AI systems.

Last, there is new text that is meant to explicitly acknowledge that it is admissible to require copyleft-like terms for any of the Code, Data Information and Parameters, individually or as bundled combinations. A demonstrative scenario is a consortium owning rights to training code and a dataset deciding to distribute the bundle code+data with legal terms that tie the two together, with copyleft-like provisions. This sort of legal document doesn’t exist yet but the scenario is plausible enough that it deserves consideration. This is another area that OSI will monitor carefully as we start reviewing these legal terms with the community.

A note about science and reproducibility

The aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.

Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. This is why OSD #2 requires that the “source code” must be provided in the preferred form for making modifications. This way everyone has the same rights and ability to improve the system as the original developers, starting a virtuous cycle of innovation. Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias. All these are possible thanks to the requirements of the Open Source AI Definition.

What’s coming next

With the release candidate cycle starting today, the drafting process will shift focus: no new features, only bug fixes. We’ll watch for new issues raised, watching for major flaws that may require significant rewrites to the text. The main focus will be on the accompanying documentation, the Checklist and the FAQ. We also realized that in our zeal to solve the problem of data that needs to be provided but cannot be supplied by the model owner for good reasons, we had failed to make clear the basic requirement that “if you can share the data you must.” We have already made adjustments in RC1 and will be seeking views on how to better express this in an RC2. 

In the next weeks until the 1.0 release of October 28, we’ll focus on:

  • Getting more endorsers to the Definition
  • Continuing to collect feedback on hackmd and forum, focusing on new, unseen-before concerns
  • Preparing the artifacts necessary for the launch at All Things Open
  • Iterating on the Checklist and FAQ, preparing them for deployment.

Link to the Open Source AI Definition Release Candidate 1

Categories: FLOSS Research

A Journey toward defining Open Source AI: presentation at Open Source Summit Europe

Tue, 2024-10-01 13:05

A few weeks ago I attended Open Source Summit Europe 2024, an event organized by the Linux Foundation, that brought together brilliant developers, technologists and leaders from all over the world, reinforcing what Open Source is truly about—collaboration, innovation and community.

I had the honor of leading a session that tackled one of the most critical challenges in the Open Source movement today—defining what it means for AI to be “Open Source.” Along with OSI Board Director Justin Colannino, we presented the v.0.0.9 for the Open Source AI Definition. This session marked an important milestone for both the Open Source Initiative (OSI) and the broader community, a moment that encapsulated years of collaboration, learning and exploration.

The story behind the Open Source AI Definition

Our session, titled “The Open Source AI Definition Is (Almost) Ready” was more than just a talk—it was an interactive dialogue. As Justin kicked off the session, he captured the essence of the journey we’ve been on. OSI has been grappling with what it means to call AI systems, models and weights “Open Source.” This challenge comes at a time when companies and even regulations are using the term without a clear, agreed-upon definition.

From the outset, we knew we had to get it right. The Open Source values that have fueled so much software innovation—transparency, collaboration, freedom—needed to be the foundation for AI as well. But AI isn’t like traditional software, and that’s where our challenge began.

The origins: a podcast and a vision

When I first became Executive Director of OSI, I pitched the idea of exploring how Open Source principles apply to AI. We spent months strategizing, and the more we dove in, the more we realized how complex the task would be. We didn’t know much about AI at the time, but we were eager to learn. We turned to experts from various fields—a copyright lawyer, an ethicist, AI pioneers from Eleuther AI and Debian ML, and even an AI security expert from DARPA. Those conversations culminated in a podcast we created called Deep Dive AI, which I highly recommend to anyone interested in this topic.

Through those early discussions, it became clear that AI and machine learning are not software in the traditional sense. Concepts like “source code,” which had been well-defined in software thanks to people like Richard Stallman and the GNU GPL, didn’t apply 1:1 to AI. We didn’t even know what the “program” was in AI, nor could we easily determine the “preferred form for making modifications”—a cornerstone of Open Source licensing. 

This realization sparked the need to adapt the Open Source principles we all know so well to the unique world of AI.

Co-designing the future of Open Source AI

Once we understood the scope of the challenge, we knew that creating this definition couldn’t be a solo endeavor. It had to be co-designed with the global community. At the start of 2023, we had limited resources—just two full-time staff members and a small budget. But that didn’t stop us from moving forward. We began fundraising to support a multi-stakeholder, global conversation about what Open Source AI should look like.

We brought on Mer Joyce, a co-design expert who introduced us to creative methods that ensure decisions are made with the community, not for it. With her help, we started breaking the problem into smaller pieces and gathering insights from volunteers, AI experts and other stakeholders. Over time, we began piecing together what would eventually become v.0.0.9 of the Open Source AI Definition.

By early 2024, we had outlined the core principles of Open Source AI, drawing inspiration from the free software movement. We relied heavily on foundational texts like the GNU Manifesto and the Four Freedoms of software. From there, we built a structure that mirrored the values of freedom, collaboration and openness, but tailored specifically to the complexities of AI.

Addressing the unique challenges of AI

Of course, defining the freedoms was only part of the battle. AI and machine learning systems posed new challenges that we hadn’t encountered in traditional software. One of the key questions we faced was: What is the preferred form for making modifications in AI? In traditional software, this might be source code. But in AI, it’s not so straightforward. We realized that the “weights” of machine learning models—those parameters fine-tuned by data—are crucial. However, data itself doesn’t fit neatly into the Open Source framework.

This was a major point of discussion during the session. Code and weights need to be covered by an OSI-approved license because they represent the modifiable core of AI systems. However, data doesn’t meet the same criteria. Instead, we concluded that while data is essential for understanding and studying the system, it’s not the “preferred form” for making modifications. Instead, the data information and code requirements allow Open Source AI systems to be forked by third-party AI builders downstream using the same information as the original developers. These forks could include removing non-public or non-open data from the training dataset, in order to retrain a new Open Source AI system on fully public or open data. This insight was shaped by input from the community and experts who joined our study groups and voted on various approaches.

The road ahead: a collaborative future

As we wrap up this phase, the next step is gathering even more feedback from the community. The definition isn’t final yet, and it will continue to evolve as we incorporate insights from events like this summit. I’m incredibly grateful for the thoughtful comments we’ve already received from people all over the world who have helped guide us along this journey.

At the core of this project is the belief that Open Source AI should reflect the same values that have made Open Source a force for good in software development. We’re not there yet, but together, we’re building something that will have a lasting impact—not just on AI, but on the future of technology as a whole.

I want to thank everyone who has contributed to this project so far. Your dedication and passion are what make Open Source so special. Let’s continue to shape the future of AI, together.

Categories: FLOSS Research

Is “Open Source” ever hyphenated?

Thu, 2024-09-26 14:25

No! Open Source is never hyphenated when referring to software. If you’re familiar with English grammar you may have more than an eyebrow raised: read on, we have an explanation. Actually, we have two. 

We asked Joseph P. De Veaugh-Geiss, a linguist and KDE’s project manager, to provide us with an explanation. If that’s not enough, we have one more argument at the end of this post. 

Why Open Source is not hyphenated

In summary:

  • “open source” (no hyphen) is a lexicalized compound noun which is no longer transparent with respect to its meaning (i.e., open source is not just about being source-viewable, but also about defining user freedoms) which can then be further compounded (with for example “open source license”);
  • by contrast, “open-source” (with a hyphen) is a compound modifier modifying the head noun (e.g. “intelligence”) with open having a standard dictionary meaning (i.e., “transparent” or “open to or in view of all”).
Open Source as a lexicalized compound noun

“Open source” is a lexicalized compound noun.  Although it originates with the phrase “open source software”, today “open source” is itself a unique lexeme. An example, in Red Hat’s article:

Open source has become a movement and a way of working that reaches beyond software production.

The word open in “open source” does not have the meaning “open” as one would find in the dictionary. Instead, “open source” also entails user freedoms, inasmuch as users of the software for any purpose do not have to negotiate with the rights owners to enjoy (use/improve/share/monetise) the software. That is, it is not only about transparency.

A natural example of this usage, in which the phrase open source license is clearly about more than just licensing transparency:

Because Linux is released under an open source license, which prevents restrictions on the use of the software, anyone can run, study, modify, and redistribute the source code, or even sell copies of their modified code, as long as they do so under the same license.” (from Red Hat website https://www.redhat.com/en/topics/open-source/what-is-open-source)

Note that “open source license” is itself a compound noun phrase made up of the lexicalized compound noun “open source” + the noun “license”; same for “open source movement”, etc.

What is lexicalization?

According to the Lexicon of linguistics (Utrecht University), ‘lexicalization’ is a “phenomenon by which a morphologically complex word starts to behave like an underived word in some respect, which means that at least one feature (semantic, syntactic, or phonological) becomes unpredictable”.

Underived word here means the phrase has a specific, unique meaning not (necessarily) transparent from its component parts. For instance, a “black market” is not a market which is black but rather a specific kind of market: an illegal one. A “blackboard” can be green. In other words, the entire complex phrase can be treated as a single unit of meaning stored in the mental lexicon. The meaning of the phrase is not derived using grammatical rules.

Today, the meaning of open source is unpredictable or semantically intransparent given its usage (at least by a subset of speakers) and meaning, i.e., open source is about user freedoms, not just transparency.

Other examples of lexicalized compound nouns include “yellow journalism”, “purple prose”, “dirty bomb”, “fat chance”, “green card”, “blackbird”, “greenhouse”, “high school”, etc. I tried to think of examples which are composed of adjectives + nouns but with a specific meaning not derivable by the combination of the two. I am sure you can come up with many more!

In some cases, lexicalization results in writing the compound noun phrase together as a single word (‘blackboard’), in other cases not (‘green card’). One can also build larger phrases by combining the lexicalized compound noun with another noun (e.g., black market dealer, green card holder).

Hyphenated open-source is a compound modifier

By contrast, open in “open-source intelligence” is the dictionary meaning of “open”, i.e., “open to or in view of all” or “transparent”. In this case, open-source is a compound modifier/compound adjective with a meaning comparable to “source-viewable”, “source-available”, “source-transparent”.

For compound modifiers, the hyphenation, though not obligatory, is common and can be used to disambiguate.  The presence of a head noun like “intelligence” or “journalism” is obligatory for the compound-modifier use of open-source, unlike in lexicalized compounds.

Examples of other compound modifiers + a head noun: “long-term contract”, “single-word modifier”, “high-volume printer”, etc.

Examples

There are some examples of  the compound-modifier use on Wikipedia where I think the difference between meanings lexicalized compound noun and compound modifier becomes clear:

“Open-source journalism, a close cousin to citizen journalism or participatory journalism, is a term coined in the title of a 1999 article by Andrew Leonard of Salon.com.” (from Wikipedia)

“Open-source intelligence” is intelligence “produced from publicly available information that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement” (from Wikipedia)

In these examples open-source is clearly referring to transparent, viewable-to-all sources and not to something like ‘guaranteeing user freedoms’. Moreover, my intuition for these latter examples is that removing the hyphen would change the meaning, however subtle it may be, and the change could make the original sentences incoherent (without implicit internal modification while reading):

  •  “open source journalism” would refer to journalism about open source software (in sense I above), not transparent, participatory journalism;
  • “open source intelligence” would refer to intelligence about open source software (in sense I above, whatever that would mean!), not intelligence from publicly available information.
The Open Source Initiative says: No hyphen!

If that explanation still doesn’t convince you, we invoke the rules of branding and “pull a Twitter”, who vandalized English with their Who To Follow : we say no hyphen!

Luckily others have already adopted the “no hyphen” camp, like the CNCF style guide. Debate closed.

If you like debates, let’s talk about capitalization: OSI in its guidelines chose to always capitalize Open Source because it is a proper noun with a specific definition. Which camp are you on?

Categories: FLOSS Research

Data Transparency in Open Source AI: Protecting Sensitive Datasets

Tue, 2024-09-24 06:49

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Tarunima Prabhakar

I am the research lead and co-founder at Tattle, a civic tech organization that builds citizen centric tools and datasets to respond to inaccurate and harmful content. My broad research interests are in the intersection of technology, policy and global development. Prior to starting Tattle, I worked as a research fellow at the Center for Long-Term Cybersecurity at UC, Berkeley studying the deployment of behavioral credit scoring algorithms towards financial inclusion goals in the global majority. I’ve also been fortunate to work on award-winning ICTD and data driven development projects with stellar non-profits. My career working in low-resource environments has turned me into an ardent advocate for Open Source development and citizen science movements. 

Protecting Sensitive Datasets

I recently gave a lightning talk at IndiaFOSS where I shared about Uli, a project to co-design solutions to online gendered abuse in Indian languages. As a part of this project, we’re building and maintaining datasets that are useful for machine learning models that detect abuse. The talk exhibited the importance of and the care that must be given when choosing a license for sensitive data, and why open datasets in Open Source AI should be carefully considered.

With the Uli project, we created a dataset annotated by gender rights activists and researchers who speak Hindi, Tamil and Indian English. Then, we fine-tuned Twitter’s XLM-RoBERTa model to detect gender abuse, which we deployed as a browser plugin. When activated, the Uli plugin would redact abusive tweets from a person’s feed. Another dataset we created was of slur words in the three languages that might be used to target people. Such a list is not only useful for the Uli plugin- these words are redacted from web pages if the plugin is installed- but they are also useful for any platform needing to moderate conversations in these languages.  At the time of the launch of the plugin, we chose to license the two datasets under an Open Data License (ODL). The model is hosted on Hugging Face and the code is available on GitHub. 

As we have continued to maintain and grow Uli, we have reconsidered how we license the data. When thinking about how to license this data, several factors come into play. First, annotating a dataset on abuse is labor-intensive and mentally exhausting, and the expert annotators should be fairly compensated for their expertise. Second, when these datasets are used by platforms for abuse detection, it creates a potential loophole—if abusive users realize the list of flagged words is public, they can change their language to evade moderation.

These concerns have led us to think carefully about how to license the data. On one end of the spectrum, we could continue to make everything open, regardless of commercial use. On the other end, we could keep all the data closed. We’ve historically operated as an Open Source organization, and every decision we make about data access impacts how we license our machine learning models as well. We are trying to find a happy medium that lets us balance the numerous concerns- recognition of effort and effectiveness of the data on one hand, and transparency, adaptability and extensibility on the other.

As we’ve thought about different strategies for data licensing, we haven’t been sure what that would mean for the license of the machine learning models. And that’s partly because we don’t have a clear definition for what “Open Source AI” really means. 

It is for this reason that we’ve closely followed the Open Source Initiative’s (OSI) process for converging on a definition for Open Source AI. OSI has been grappling with the definition of “Open Source AI” as it pertains to the four freedoms: the freedom to use, study, modify, and share. Over the past year, the OSI has been iterating on a definition for Open Source AI, and they’ve reached a point where they propose the following:

  • Open weights: The model weights and parameters should be open.
  • Open source code: The source code used to train the system should be open.
  • Open data or transparent data: Either the dataset should be open, or there should be enough detailed information for someone to recreate the dataset.

It’s important to note that the dataset doesn’t necessarily have to be open. The departure from a stance of maximally open dataset accounts for the complexity in the collection and management of data driving real world ML applications. While frontier models need to deal with copyright and privacy concerns, many smaller projects like ours worry about the uneven power dynamics between those creating the data and the entities using it. In our specific case, opening data also reduces its efficacy.

But having struggled with papers that describe research or data without sharing the dataset itself, I also recognize that ‘enough detailed information’ might not be information enough to repeat, adapt or extend another group’s work. In the end, the question becomes: how much information about the dataset is enough to consider the model “open?” It’s a fine line, and not everyone is comfortable with OSI’s stance on this issue. For our project in particular, we are considering the option of staggered data release- older data is released under an open data license, while the newest data requires users to request access. 

If you have strong opinions on this process, I encourage you to visit the OSI website and leave feedback. The OSI process is influential, and your input on open weights, open code, and their specifications around data openness could shape the future of Open Source AI.

You can learn more about the participatory process behind the Uli dataset here, and about Uli and Tattle on their respective websites. 

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Categories: FLOSS Research

Open Source AI Definition – Weekly update September 23

Mon, 2024-09-23 09:31
Draft v.0.0.9 of the Open Source AI Definition is available for comments
  • @nemobis points out that the term “skilled person” in the Open Source AI Definition needs clarification, especially when considering different legal systems. The term could lead to misinterpretations and suggests adjusting the wording to focus on access to data. Additionally, the term “substantially equivalent system” also requires a more precise definition. 
  • @shujisado adds that in Japan, the term “skilled person” is linked to patent law, which could complicate its interpretation. He proposes using a simpler term, like “person skilled in technology,” to avoid unnecessary debate.
  • @stefano asks for suggestions for a better alternative to “skilled person,” such as “practitioner” or “AI practitioner.”
  • @kjetilk jokingly suggests lowering the bar to “any random person with a computer,” emphasizing the importance of accessibility in open source, allowing anyone to engage regardless of formal training.
  • @samj highlights that byte-for-byte reproducibility is unrealistic, as randomness and hardware variability make exact replication unachievable, similar to how different binaries perform equivalently despite differing checksums.
  • @samj notes the existence of models like StarCoder2 and OLMo as examples of Open Source AI, refuting the claim that no models meet the standard. He stresses the need for the definition to encourage the development of new models rather than settling for an inadequate status quo.
Case-in-Point: Zuckerberg’s blog on Open Source
  • @kjetilk reflects on Mark Zuckerberg’s blog post about Llama 3.1, where Zuckerberg claims that “Open Source AI Is the Path Forward.” He points out that while it’s easy to agree with Zuckerberg’s sentiment, Llama 3.1 isn’t truly open source and wouldn’t meet the criteria for compliance under the OSAID. This raises important questions about how to engage with Meta: should the open-source community push them away, or guide them toward creating OSAID-compliant models? Furthermore, @kjetilk wonders how this affects perceptions of open source, especially in light of EU legislation and the broader governance issues around open source.
  • @shujisado responds by noting that the Open Source Initiative (OSI) has already made it clear that Llama 2 (and by extension Llama 3.1) does not meet the Open Source definition, despite Zuckerberg’s claims. He suggests that Zuckerberg might be using a different definition of “open source,” particularly given the unclear legal landscape around AI training data and copyright. In his view, the creation of the Open Source AI Definition (OSAID) is the community’s formal response to Meta’s claims.
Open Source AI Definition Town Hall – September 20, 2024
Categories: FLOSS Research

David Manset: Voices of the Open Source AI Definition

Tue, 2024-09-17 14:34

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet David Manset

What’s your background related to Open Source and AI?

My background in Open Source and AI is shaped by my ongoing experience as the senior coordinator of the Open Source Ecosystem Enabler (OSEE) project at the United Nations International Telecommunication Union (ITU), a project developed in collaboration with the UN Development Program and under the funding of the EU’s Directorate-General for International Partnerships, to support countries developing digital public goods and services using Open Source. In this capacity, a significant part of my work involves driving initiatives related to Open Source for various types of use cases in the public sector.

Witnessing the birth of an Open Source AI definition during the DPGA Annual Members meeting in 2023, I have since then been contributing to the Open Source AI agenda, and more recently to various Open Source AI initiatives within the ITU Open Source Program Office (OSPO). Additionally, I co-lead the Open Source AI for Digital Public Goods (OSAI4DPG) track at AI for Good, focusing on creating AI-driven public goods that are both accessible and affordable.

One of my recent achievements includes co-organizing the AIntuition hackathon aimed at developing cost-effective Open Source AI solutions. This event focused on utilizing Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) to create a basic yet understandable and adaptable prototype implementation for public administration. My efforts in this area highlight my commitment to practical and usable AI tools that meet public sector needs.

Prior to my role at ITU, I worked in the private sector, where I developed AI services that enhanced healthcare services and protected patients/citizens. This experience gives me a well-rounded perspective on implementing and scaling Open Source AI technologies for public benefit.

What motivated you to join this co-design process to define Open Source AI?

My motivation to participate in this co-design process for defining Open Source AI is deeply rooted in my former experiences in software development and as the coordinator of the OSEE project, where my focus lies in enhancing digital public services and developing digital public goods. Open Source AI indeed presents a unique opportunity, especially for the public sector, to adopt cost-effective and scalable solutions that can significantly improve public services. However, to harness these benefits, it is imperative to establish a clear, standardized and consensual definition of Open Source AI. This definition will serve as a foundational guideline, ensuring transparency and understanding of the specific types of AI technologies being developed and implemented.

Moreover, my involvement is driven by the critical work of the ITU OSPO, particularly in developing Open Source AI solutions tailored for low- and middle-income countries (LMICs). These regions often face challenges such as scarce resources and limited representation in global AI training processes. By contributing to the development of Open Source AI, I aim to support these countries in accessing affordable and effective AI technologies, thereby promoting greater equity in AI development and utilization. This effort is not just about technology but also about fostering global inclusivity and ensuring that the benefits of AI are accessible to all.

Why do you think AI should be Open Source?

AI should be Open Source for several compelling reasons, especially when considering its potential impact on global development and governance. First, transparency, traceability and explainability are crucial, particularly in digital public services. Open Source AI allows public scrutiny of the algorithms and models used, ensuring that decision-making processes are transparent and accountable. This is vital for building trust in AI systems, especially in sectors like healthcare, education and public administration, where decisions can significantly impact individuals and communities.

Second, accessibility and affordability are key benefits for LMICs. Open Source AI lowers the barriers to entry, enabling these countries to access cutting-edge technologies without the prohibitive costs associated with proprietary systems. This democratization of AI technology ensures that even resource-constrained nations can harness AI’s transformative potential. Moreover, Open Source AI fosters greater representation and competition for LMICs in the global AI landscape. By contributing to and benefiting from Open Source projects, these countries can influence AI development and ensure that their specific needs and contexts are considered.

Finally, as AI increasingly becomes a foundational technology, Open Source serves as a universal resource that can be adapted and improved by anyone, promoting innovation and inclusivity across the globe.

What new perspectives or ideas did you encounter while participating in the co-design process?

Participating in the co-design process introduced me to several new perspectives and ideas that have deepened my understanding of the role of Open Source AI, particularly in supporting global development. One key insight is the realization that LMICs would significantly benefit from having access to an Open Source AI reference implementation. This concept, which we are actively working on, would provide these countries with a practical, ready-to-use model for AI development, helping them overcome resource constraints and accelerate their AI initiatives.

Another important perspective is that Open Source AI requires solid foundational elements—an Open Source mindset, adherence to best practices, and generalized policies must be embedded across all organizations involved. This is not just about technology; it’s about fostering a culture and infrastructure that supports Open Source principles at every level. Notably, ITU is now coordinating the definition of a common policy framework for United Nations Open Source initiatives, which will be crucial in guiding future Open Source AI developments. This framework will ensure that Open Source AI projects are supported by robust Open Source policies, promoting sustainable and equitable technological advancement worldwide.

What do you think the primary benefit will be once there is a clear definition of Open Source AI?

The primary benefit of a clear definition of Open Source AI will be the establishment of a unified framework that ensures transparency, accessibility, and ethical standards in AI development. This clarity will enable broader adoption across various sectors, particularly in LMICs, by providing a reliable foundation for building and implementing AI technologies. It will also foster global collaboration, ensuring that AI advancements are inclusive and equitable, while promoting innovation through open contributions, ultimately leading to more trustworthy and widely beneficial AI solutions.

What do you think are the next steps for the community involved in Open Source AI?

Once a global standard definition of Open Source AI is established, the Open Source AI community should focus on several key steps to ensure its widespread adoption and effective implementation. These include developing comprehensive guidelines and best practices, creating reference implementations to help organizations, particularly in LMICs, adopt the standard, and enhancing global collaboration through international networks and partnerships. Additionally, launching education and awareness campaigns will be crucial for informing stakeholders about the benefits and practices of Open Source AI. Establishing a governance and compliance framework will help maintain the integrity of AI projects, while supporting policy development and advocacy will ensure alignment with national and international regulations. Finally, fostering innovation and research through funding, hackathons, and collaborative platforms will drive ongoing advancements in Open Source AI. These steps will help build a robust, inclusive, and impactful Open Source AI ecosystem that benefits societies globally.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Categories: FLOSS Research

Open Source AI Definition – Weekly update september 16

Mon, 2024-09-16 19:38
Week 37 summary  Endorse the Open Source AI Definition Recommended Resources: US Copyright Office Guidance on TDM
  • @mjbommar encourages reviewing the U.S. Copyright Office’s guidance on text and data mining (TDM) exceptions, which provides clear explanations and limitations, especially focusing on non-commercial, scholarly, and teaching uses. He emphasizes that the TDM guidance operates within narrow parameters that are often misunderstood or overlooked.
Proposal to handle Data Openness in the Open Source AI definition [RFC]
  • @quaid proposes adding nuance to the Open Source AI (OSAI) Definition by introducing two designations: OSAI D+ (with open data) and OSAI D- (without open data, due to legitimate reasons beyond the creator’s control). He suggests using a dataset certificate of origin (dataset DCO) for self-verification to ensure compliance.
  • @kjetilk agrees that verification is key but questions whether data information alone is sufficient for verification. He highlights that verifying rights to the data may not always be possible.
  • @stefano appreciates the quadrant system’s clarity and confirms @quaid’s proposal for OSAI D- to be reserved for those with legitimate reasons for not sharing data.
  • @thesteve0 expresses skepticism about broadening the “Open Source” label. He argues that without access to both data and code, AI models cannot truly be Open Source and suggests labeling such models as “open weights” instead.
  • @shujisado notes the importance of data access in AI, pointing out that OSAID requires detailed information about how data is sourced, including provenance and selection criteria. He also discusses potential legal and ethical reasons for not sharing datasets.
  • @Shamar raises concerns about “openwashing” in AI, where developers might distribute a model with a different dataset, undermining trust. He argues that distinguishing between OSAI D+ and D- risks legal complications for derivative works, suggesting that models without open data should not be considered truly open.
  • @zack supports the idea of a tiered system (D+ and D-) as an improvement over the current situation, as it incentivizes progress from D- to D+. He is skeptical about verifiability but sees potential in the branding aspect of the proposal.
Welcome diverse approaches to training data within a unified Open Source AI Definition
  • @stefano asks @arandal about suggested edits, which include renaming data as “source data,” allowing open-source AI developers to require downstream modifications with open data, and permitting downstream developers to use open data to fine-tune models trained on non-public data. He further asks if arandal compares training data to model weights as source code is to binary code.
  • @shujisado agrees with @stefano and points out that while many interpret OSD-compliant licenses to include CC4 and CC0, OSI has not officially evaluated Creative Commons licenses for compliance. He highlights concerns about CC0’s patent defense, which could be crucial for datasets.
  • @mjbommar echoes the concerns about patent defense, noting it as a critical issue in both software and data licensing.
  • @Shamar supports the first two suggestions but argues that models trained on non-public data cannot meet an “Open Source AI” definition, as they limit the freedom to study and modify, which are core principles of Open Source.
On the current definition of Open Source AI and the state of the data commons
  • @nick shares an article by Nathan Lambert, reviewed by key figures in the Open Source AI space, discussing the challenges of training data and the current Open Source AI definition. @Percy Liang (on X) view is highlighted, where he suggests that releasing an entire dataset is neither sufficient nor necessary for Open Source AI. He emphasizes the need for detailed code of the data processing pipeline for transparency, beyond just releasing the dataset.
  • @shujisado discusses the legal nuances of using U.S. government documents in AI training, emphasizing that while they may be used in the U.S., legal complications arise in other jurisdictions.
  • @Shamar stresses that Open Source AI should provide all the necessary data and processing information to recreate a system, otherwise, calling it Open Source is “open washing.”
[RFC] Separating concerns between Source Data and Processing Information
  • @Shamar proposes a clearer distinction between “source data” and “processing information” in the Open Source AI definition to ensure transparency and reproducibility. He suggests source data should be publicly available under the same terms that allowed its original use, while the process used to train the system should be shared under an Open Source license. His formulation aims to prevent loopholes that could lead to open-washing and emphasizes the importance of granting all four freedoms (study, modify, distribute, and use) to qualify as Open Source AI.
  • @nick disagrees, arguing that @Shamar proposal misunderstands the difference between the rights to use data for training and the rights to distribute it. He also challenges the claim that exact replication of AI systems can be guaranteed, even with access to the same data.
Open Source AI Definition Town Hall – September 13, 2024

Categories: FLOSS Research

Copyright law makes a case for requiring data information rather than open datasets for Open Source AI

Wed, 2024-09-11 16:04

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Felix Reda Photo Credit: CC-by 4.0 International Volker Conradus volkerconradus.com.

Felix Reda (he/they) has been an active contributor to the Open Source AI Definition (OSAID) co-design process, bringing his personal interest and expertise in copyright reform to the online forums. Working in digital policy for over ten years, including serving as a member of the European Parliament from 2014 to 2019 and working with the strategic litigation NGO Gesellschaft für Freiheitsrechte (GFF), Felix is currently the director of developer policy at GitHub. He is also an affiliate of the Berkman Klein Center for Internet and Society at Harvard and serves on the board of the Open Knowledge Foundation Germany. He holds an M.A. in political science and communications science from the University of Mainz, Germany.

Data information as a viable alternative

Note: The original text was contributed by Felix Reda to the discussions happening on the Open Source AI forum as a response to Stefano Maffulli’s post on how the draft Open Source AI Definition arrived at its current state, the design principles behind the data information concept and the constraints (legal and technical) it operates under.

When we look at applying Open Source principles to the subject of AI, copyright law comes into play, especially for the topic of training data access. Open datasets have been a continuous discussion point in the collaborative process of writing the Open Source AI Definition. I would like to explain why the concept of data information is a viable alternative for the purposes of the OSAID.

The definition of Open Source software has an access element and a legal element – the access element being the availability of the source code and the legal element being a license rooted in the copyright-protection given to software. The underlying assumption is that the entity making software available as Open Source is the rights holder in the software and is therefore entitled to make the source code available without infringing the copyright of a third party, and to license it for re-use. To the extent that third-party copyright-protected material is incorporated into the Open Source software, it must itself be released under a compatible Open Source license that also allows the redistribution.

When it comes to AI, the situation is fundamentally different: The assumption that an Open Source AI model will only be trained on copyright-protected material that the developer is entitled to redistribute does not hold. Different copyright regimes around the world, including the EU, Japan and Singapore, have statutory exceptions that explicitly allow text and data mining for the purposes of AI training. The EU text and data mining exceptions, which I know best, were introduced with the objective of facilitating the development of AI and other automated analytical techniques. However, they only allow the reproduction of copyright-protected works (aka copying), but not the making available of those works (aka posting them on the internet).

That means that an Open Source AI definition that would require the republication of the complete dataset in order for an AI model to qualify as Open Source would categorically exclude Open Source AI models from the ability to rely on the text and data mining exceptions in copyright – that is despite the fact that the legislator explicitly decided that under certain circumstances (for example allowing rights holders to declare a machine-readable opt-out from training outside of the context of scientific research) the use of copyright-protected material for the purposes of training AI models should be legal. This result would be particularly counterproductive because it would even render Open Source AI models illegal in situations where the reproducibility of the dataset would be complete by the standards discussed on the OSAID forum.

Examples

Imagine an AI model that was trained on publicly accessible text on the internet that was version-controlled, for which the rights holder had not declared an opt-out, but which the rights holder had also not put under a permissive license (all rights reserved). Using this text as training data for an AI model would be legal under copyright law, but re-publishing the training dataset would be illegal. Publishing information about the training dataset that included the version of the data that was used, when and how it was retrieved from which website, and how it was tokenized would meet the requirements of the OSAID v 0.0.8 if (and only if) it put a skilled person in the position to build their own dataset to recreate an equivalent system. 

Neither the developer of the original Open Source AI model nor the skilled person recreating it would violate copyright law in the process, unlike the scenario that required publication of the dataset. Including a requirement in the OSAID to publish the data, in which the AI developer typically does not hold the copyright, would have little added benefit but would drastically reduce the material that could be used for training, despite the existence of explicit legal permissions to use that content for AI training. I don’t think that would be wise.

The international concern of public domain

While I support the creation of public domain datasets that can be republished without restrictions, I would like to caution against pointing to these efforts as a solution to the problem of copyright in training datasets. Public domain status is not harmonized internationally – what is in the public domain in one jurisdiction is routinely protected by copyright in other parts of the world. For example, in US discourse it is often assumed that works generated by US government employees are in the public domain. They are not, they are only in the public domain in the US, while they are copyright-protected in other jurisdictions. 

The same goes for works in which copyright has expired: Although the Berne Convention allows signatory countries to limit the copyright term on works until protection in the work’s country of origin has expired, exceptions to this rule are permitted. For example, although the first incarnation of Mickey Mouse has recently entered the public domain in the US, it is still protected by copyright in Germany due to an obscure bilateral copyright treaty between the US and Germany from 1892. Copyright protection is not conditional on registration of a work, and no even remotely comprehensive, reliable rights information on the copyright status of works exists. Good luck to an Open Source AI developer who tried to stay on top of all of these legal pitfalls.

Bottom line

There are solid legal permissions for using copyright-protected works for AI training (reproductions). There are no equivalent legal permissions for incorporating copyright-protected works into publishable datasets (making available). What an Open Source AI developer thinks is in the public domain and therefore publishable in an open dataset regularly turns out to be copyright-protected after all, at least in some jurisdictions. 

Unlike reproductions, which only need to follow the copyright law of the country in which the reproduction takes place, making content available online needs to be legal in all jurisdictions from which the content can be accessed. If the OSAID required the publication of the dataset, this would routinely lead to situations where Open Source AI models could not be made accessible across national borders, thus impeding their collaborative improvement, one of the great strengths of Open Source. I doubt that with such a restrictive definition, Open Source AI would gain any practical significance. Tragically, the text and data mining exceptions that were designed to facilitate research collaboration and innovation across borders, would only support proprietary AI models, while excluding Open Source AI. The concept of data information will help us avoid that pitfall while staying true to Open Source principles.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Categories: FLOSS Research

Jordan Maris joins OSI

Tue, 2024-09-10 09:32

Helen Keller said, “Alone we can do so little; together we can do so much.” Although she wouldn’t have understood this 2024 expression, we know “she nailed it.” It takes many of us working together to truly accomplish great things. That’s why the OSI staff is so excited to welcome Jordan Maris to our team.

As OSI’s European Policy Analyst, Jordan will work to build a bridge between European Union legislators, the OSI and the wider Open Source community. He will monitor upcoming EU policies and flag issues and opportunities, educate and inform EU lawmakers about Open Source and its benefits, represent the OSI at EU-level events and conferences, and provide analysis and support to the OSI’s board and members on EU policy issues. He will also work closely with other Open Source foundations and organizations to make sure the voice of the Open Source community is heard at an EU level.

Jordan comes well-equipped with the experience he needs to excel in this role. He worked for three years with members of the European Parliament. In his previous position as a senior parliamentary policy advisor, he fought for the Open Source community on laws such as the AI Act, European Digital Identity, Data Act, Product Liability Directive, and Cyber-Resilience Act. He is a strong advocate for the Public Money–Public Code principle and a long-time user of and occasional contributor to Open Source software. He speaks English, French and German.

When asked about his vision for the future of Open Source, Jordan replied, “A world where Open Source is the rule — not the exception, and where developers and communities are consistently supported, listened to and valued.” 

Jordan says, “I’m looking forward to being able to devote more time to raising awareness about Open Source among lawmakers and to bringing together the Open Source community and EU lawmakers so that new laws better reflect the needs of the Open Source community.”

Please join me in welcoming Jordan to the team.

Categories: FLOSS Research

Open Source AI Definition – Weekly update September 9

Mon, 2024-09-09 13:02

Week 36 summary 

Draft v.0.0.9 of the Open Source AI Definition is available for comments
  • -@Shamar agrees with @thesteve0 and emphasizes that AI systems consist of two parts: a virtual machine (architecture) and the weights (the executable software). He argues that while weights are important, they are not sufficient to study or fully understand an AI model. For a system to be truly Open Source, it must provide all the data used to recreate an exact copy of the model, including random values used during the process. Without this, the system should not be labeled Open Source, even if the weights are available under an open-source license. Shamar suggests calling such systems “freeware” instead and ensuring the Open Source AI Definition aligns with the Open Source Definition.
  • @jberkus questions whether creating an exact copy of an AI system is truly possible, even with access to all the training data, or if slight differences would always exist.
  • @shujisado explains that under Japan’s copyright law, AI training on publicly available copyrighted works is permissible, but sharing the datasets created during training requires explicit permission from copyright holders. He notes that while AI training within legal limits may be allowed in many jurisdictions, making all training data freely available is unlikely. He adds that the current Open Source AI Definition strikes a reasonable balance given global intellectual property rights but suggests that more specific language might help clarify this further.
Share your thoughts about draft v0.0.9
  • @marianataglio suggests including hardware specifications, training time, and carbon footprint in the Open Source AI Definition to improve transparency. She believes this would enhance reproducibility, accessibility, and collaboration, while helping practitioners estimate computational costs and optimize models for more efficient training.
Open Source AI Definition Town Hall – September 6, 2004 Welcome diverse approaches to training data within a unified Open Source AI Definition Explaining the concept of Data information
  • @Senficon highlights a concern from the open science community that, while EU copyright law allows reproductions of protected content for research, it restricts making the research corpus available to third parties. This limits research reproducibility and open access, as it aims to protect rights holders’ revenue.
  • @kjetilk agrees with the observation but questions the assumption that making content publicly available would significantly harm rights holders’ revenue. He believes such policies should be based on solid evidence from extensive research.
Categories: FLOSS Research

Members Newsletter – September 2024

Wed, 2024-09-04 13:35
September 2024 Members Newsletter

It’s been a busy couple of months, and things are going to stay that way as we approach All Things Open in October. Version 0.0.9 of the Open Source AI Definition has been released after collecting months of community feedback.

We’re continuing our march towards a stable release by the end of October 2024, at All Things Open. Get involved by joining the discussion on the forum, finding OSI staff around the world, and online at the weekly town halls. The community continues iterating through drafts after meeting diverse stakeholders at the worldwide roadshow, collecting feedback and carefully looking for new arguments in dissenting opinions. All thanks to a grant by the Alfred P. Sloan Foundation. We also need to decide how to best address the reviews of new licenses for datasets, documentation and the agreements governing model parameters. 

The lively conversations will continue at conferences, town halls, and online. The first two stops were at AI_dev and Open Source Congress. Other events are planned to take place in Africa, South America, Europe and North America.

On a separate delightful note, the Open Source community got some welcome news on August 29, as Elastic returned to the community by adding the AGPL licensing option for Elasticsearch and Kibana. This decision is confirmation that shipping software with licenses that comply with the Open Source Definition is valuable—to the maker, to the customer, and to the user. Elastic’s choice of a strong copyleft license signals the continuing importance of that license and its dual effect: one, it’s designed to preserve the user’s freedoms downstream, and two, it also grants strong control over the project by the single-vendor developers.

We’re encouraged to see Elastic return to the Open Source community. And who knows… maybe others will follow suit!

Stefano Maffulli

Executive Director, OSI 

I hold weekly office hours on Fridays with OSI members: book time if you want to chat about OSI’s activities, if you want to volunteer or have suggestions.

News from the OSI Community input drives the new draft of the Open Source AI Definition

From the Research and Advocacy program

The Open Source AI Definition v0.0.9 has been released and collaboration continues at in-person events and in the online forums. Read what changes have been made, what to do next and how to get involved. Read more.

Three things I learned at KubeCon + AI_Dev China 2024

From the Research and Advocacy program

KubeCon China 2024 was a whirlwind of innovation, community and technical deep dives. As it often happens at these community events, I was blown away by the energy, enthusiasm and sheer amount of knowledge being shared. Read more.

Highlights from our participation at Open Source Congress

From the Research and Advocacy program

The Open Source Initiative (OSI) proudly participated in the Open Source Congress 2024, held from August 25-27 in Beijing, China. This event was a pivotal gathering for key individuals in the Open Source nonprofit community, aiming to foster collaboration, innovation, and strategic development within the ecosystem. Read more.

OSI in the news Elasticsearch is open source, again

OSI at elastic.co

“Being able to call Elasticsearch and Kibana Open Source again is pure joy.” — Shay Banon, Elastic Founder and CTO. Read more.

Meta is accused of bullying the open source community

OSI at The Economist

Purists are pushing back against Meta’s efforts to set its own standard on the definition of open-source AI. Stefano Maffulli, head of the OSI, says Mr Zuckerberg “is really bullying the industry to follow his lead”. Read more.

Debate over “open source AI” term brings new push to formalize definition

OSI at Ars Technica

The Open Source Initiative (OSI) recently unveiled its latest draft definition for “open source AI,” aiming to clarify the ambiguous use of the term in the fast-moving field. The move comes as some companies like Meta release trained AI language model weights and code with usage restrictions while using the “open source” label. This has sparked intense debates among free-software advocates about what truly constitutes “open source” in the context of AI. Read more.

Other Highlights Other news News from OSI affiliates News from OpenSource.net Voices of the Open Source AI Definition

The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process.

2024 Generative AI Survey

This survey aims to understand the deployment, use, and challenges of generative AI technologies in organizations and the role of open source in this domain. Take survey here.

Events Upcoming events CFPs Thanks to our sponsors New members and renewals
  • Mercado Libre

Interested in sponsoring, or partnering with, the OSI? Please see our Sponsorship Prospectus and our Annual Report. We also have a dedicated prospectus for the Deep Dive: Defining Open Source AI. Please contact the OSI to find out more about how your company can promote open source development, communities and software.

Support OSI by becoming a member!

Let’s build a world where knowledge is freely shared, ideas are nurtured, and innovation knows no bounds! 

Join the Open Source Initiative!

Categories: FLOSS Research

Highlights from our participation at Open Source Congress 2024

Wed, 2024-09-04 13:30

The Open Source Initiative (OSI) proudly participated in the Open Source Congress 2024, held from August 25-27 in Beijing, China. This event was a gathering for key individuals in the Open Source nonprofit community, aiming to foster collaboration, innovation, and strategic development within the ecosystem. Here are some highlights from OSI’s participation at the event.

Panel: Collaboration between Open Source Organizations

Stefano Maffulli, OSI’s Executive Director, played an important role in the panel on “Collaboration between Open Source Organizations.” This session, moderated by Daniel Goldscheider (Executive Director, OpenWallet Foundation) and Chris Xie (Board Advisor, Linux Foundation Research), brought together influential leaders, including Keith Bergelt (CEO, Open Invention Network), Bryan Che (Advisory Board Member, Software Heritage Foundation), Mike Milinkovich (Executive Director, Eclipse Foundation), Rebecca Rumbul (Executive Director, Rust Foundation), Xiaohua Xin (Deputy Secretary-General, OpenAtom Foundation), and Jim Zemlin (Executive Director, Linux Foundation). The panel discussed the importance of collaboration in addressing the challenges faced by the Open Source ecosystem and explored ways to strengthen inter-organizational ties.

Fireside Chat: Datasets, Privacy, and Copyright

Stefano Maffulli also led a fireside chat on “Datasets, Privacy, and Copyright” in the context of Open Source AI along with Donnie Dong (Steering Committee Member, Digital Asia Hub; Senior Partner, Hylands Law Firm). This session was particularly relevant given the growing concerns around AI and the legal implications of creating and distributing large datasets. The discussion provided valuable insights into how these issues intersect with Open Source principles and what steps the community can take to address them responsibly. Some questions addressed included the use of copyrighted material in training datasets; fair use in the context of AI training and content generation; and China’s AI regulatory framework.

Talk: The Open Source AI Definition

OSI’s involvement was further highlighted by Stefano Maffulli’s talk on “The Open Source AI Definition,” where he announced version 0.0.9 of the Open Source AI Definition (OSAID), a significant milestone resulting from a multi-year, global, and multi-stakeholder process. This version reflects the collective input of a diverse range of experts and community members who participated in extensive co-design workshops and public consultations, ensuring that the definition is robust, inclusive, and aligned with the principles of openness. Maffulli emphasized the importance of the “4 Freedoms of Open Source AI”—Use, Study, Modify, and Share—as foundational principles guiding the development of AI technologies. The session was particularly crucial for gathering feedback from the community in China, providing a platform for discussing the practical implications of the OSAID in different cultural and regulatory contexts.

Panel: The Future of Open Source Congress

Deborah Bryant, OSI’s US Policy Director, moderated a pivotal panel discussion on “The Future of Open Source Congress: Converting Ideas to Shared Action.” This session focused on how the community can transform discussions into actionable strategies, ensuring the continued growth and impact of Open Source globally.

Other highlights from the event

The “Unlocking Innovation: Open Strategies in Generative AI” panel led by Anni Lai (Chair of Generative AI Commons; Board member of LF AI & Data; Head of Open Source Operations, Futurewei) explored how openness is essential for advancing Generative AI innovation, democratizing access, and ensuring ethical AI practices. Panelists Richard Sikang Bian (Outreach Chair, LF AI & Data; Head of OSPO, Ant Group), Richard Lin (Member, OpenDigger Community; Head of Open Source, 01.ai), Ted Liu (Co-founder, KAIYUANSHE), and Zhenhua Sun (China Workgroup Chair, OpenChain; Open Source Legal Counsel, ByteDance) delved into the challenges of the Open Source generative AI landscape, such as “open washing,” inconsistent definitions, and the complexities of licensing. They highlighted the need for clear, standardized frameworks to define what truly constitutes Open Source AI, emphasizing that openness fosters transparency, accelerates learning, and mitigates biases. The panelists called for increased collaboration among stakeholders to address these challenges and further develop Open Source AI standards, ensuring that AI technologies are transparent, ethical, and widely adoptable.

In her closing keynote at the Open Source AI track, Amreen Taneja, Standards Lead at the Digital Public Goods Alliance (DPGA), emphasized the critical role of Open Source AI in advancing public good and supporting the Sustainable Development Goals (SDGs). She explained that Digital Public Goods (DPGs) are digital technologies made freely available to benefit society and highlighted the importance of OSAI in democratizing access to powerful AI technologies. Taneja outlined the DPGA’s efforts to align AI with public interests, including updating the DPG Standard to better accommodate AI, ensuring transparency in AI development, and promoting responsible AI practices that prioritize privacy and avoid harm. She stressed the need for rigorous evaluation, clear ownership, open licensing, and platform independence to drive the adoption of AI DPGs, ultimately aiming to create AI systems that are ethical, transparent, and beneficial for all.

Quotes from OSI Board and affiliates

Attending the Open Source Congress was really inspiring. Over two days, we participated in intensive discussions and exchanges with dozens of Open Source foundations and organizations worldwide, which was incredibly beneficial. I believe this will foster broader cross-community collaboration globally. I hope the conclusion of the second Open Source Congress marks the beginning of ongoing cooperation, allowing our “community of communities” to maintain regular communication and exchange. 

Nadia JiangBoard Chair of KAIYUANSHE

Open Source development experience is all about two words: consensus and antifragile decision-making process. The most valuable part of this event is seeing and listening to all the executive directors, open-source leaders in the room, and being very comfortable with the information density and the constructiveness of the discussions. Towards the end of the day, what people care about are not fundamentally different and there are indeed really difficult questions to resolve. I feel the world becomes slightly better after this OSC, and that means a lot to have an event like this.

Richard BianHead of Ant Group OSPO; Outreach Chair, Linux Foundation AI & Data

Open Source is the cornerstone of innovation, transparency, and collaboration, driving solutions that benefit everyone. The Open Source Congress 2024 represented a significant step forward in fostering alignment and building consensus within the open source community. By bringing together diverse voices and ideas, it amplified our collective efforts to create a more open, inclusive, and impactful digital ecosystem for the future.

Amreen TanejaStandards Lead, Digital Public Goods Alliance

Stefano Maffulli with Board Directors of KAIYUANSHE: Emily Chen, Nadia Jiang (photo credits), and Ted Liu. Conclusion

OSI’s active participation in the Open Source Congress 2024 reinforced its leadership role in the global Open Source community. By engaging in critical discussions, leading panels, and contributing to the future direction of Open Source initiatives, OSI continues to shape the landscape of Open Source development, ensuring that it remains inclusive, innovative, and aligned with the values of the global community.

This event marked another successful chapter in OSI’s ongoing efforts to drive collaboration and innovation in the Open Source world. We extend our sincere thanks to the organizers of OSC and the Open Source community in China for creating a platform that brought together a diverse and dynamic group of stakeholders, enabling meaningful discussions and progress. We look forward to continuing these conversations and turning ideas into action in the years to come.

Categories: FLOSS Research

Open Source AI Definition – Weekly update September 2nd

Mon, 2024-09-02 10:17
Share your thoughts about draft v0.0.9
  • @mkai added concerns about how OSI will address AI-generated content from both open and closed source models, given current legal rulings that such content cannot be copyrighted. He also suggests clarifying the difference between licenses for AI model parameters and the model itself within the Open Source AI Definition.
  • @shujisado added that while media coverage of the OSAID v0.0.9 release is encouraging, he is not supportive of the idea of an enforcement mechanism to flag false open source AI. He believes this approach differs from OSI’s traditional stance and suggests it may be a misunderstanding.
  • @jplorre added that while LINAGORA supports the proposed definition, they propose clarifying the term “equivalent system” to mean systems that produce the same outputs given identical inputs. They also suggest removing the specific reference to “tokenizers” in the definition, as it may not apply to all AI systems.
    • @shujisado agreed with the need for clarification on “equivalent system” but noted that identical outputs cannot always be guaranteed in general LLMs. He suggests that this clarification might be better suited for the checklist rather than the OSAID itself

Draft v.0.0.9 of the Open Source AI Definition is available for comments

  • @adafruit reconnects with @webmink and proposes updates to the Open Source AI Definition, including adding requirements for prompt transparency and data access during AI training. These updates aim to enhance the ability to audit, replicate, and modify AI models by providing detailed logs, documentation, and public access to prompts used during the training phase.
    • @webmink appreciates the proposal but points out that it seems specific to a single approach, suggesting that it may need broader applicability.
  • @thesteve0 criticizes the current definition, arguing that it does not grant true freedom to modify AI models because the weights, which are essential for using the model, cannot be reproduced without access to both the original data and code. He suggests that models sharing only their weights, especially when built on proprietary data, should be labeled as “open weights” rather than “open source.” He also expresses concern about the misuse of the “open source” label by some AI models, citing specific examples where the term is being abused.
Open-washing and unspoken assumptions of OSS
  • @pranesh added that it might be helpful to explicitly state that the governance of open-source AI is out of scope for OSAID, but also notes that neither the OSD nor the free software definition explicitly mention governance, so it may not be necessary.
  • @kjetilk added that while governance issues have traditionally been unspoken, this unspoken nature is a key problem that needs addressing. He suggests that OSI should explicitly declare governance out of scope to allow others to take on this responsibility.
  • @mjbommar added support for making an official statement that OSI does not intend to control governance, noting concerns that some might fear OSI is moving towards a walled governance approach. He references past regrets about not controlling the “open source” trademark as a means to combat open-washing.
  • @nick added assurance that OSI has no intention of creating a walled governance garden, reaffirming the organization’s long-standing position against such control.
  • @shujisado added that there seems to be a consensus within the OSAID process that governance is out of scope, and notes that related statements have already been moved to the FAQ section in recent versions.
Explaining the concept of Data information
  • @pranesh mentions that, from a legal perspective, the percentage of infringement matters, citing the “de minimis” doctrine and defenses like “fair use” that consider the amount and purpose of infringement. He emphasizes that copyright laws in different jurisdictions vary, and not all recognize the same defenses as in the US.
  • @mjbommar argues that the scale and nature of AI outputs make the “de minimis” defense irrelevant, especially when AI models generate significant amounts of copyrighted content. He stresses that the economic impact of AI-generated content is a key factor in determining whether it qualifies as transformative or infringes copyright.
  • @shujisado highlights that in Japan, using copyrighted works for AI training is generally treated as an exception under copyright law, a stance that is also being adopted by neighboring East Asian countries. He suggests that approaches like the EU Directive are unlikely to become mainstream in Asia.
  • @mjbommar acknowledges the global focus on US/EU laws but points out that many commonly used models are developed by Western organizations. He questions how Japan’s updated copyright laws align with international treaties like WCT/DMCA, expressing concern that they may allow practices that conflict with these agreements.
    • @shujisado responds by stating that Japan’s copyright laws, including Article 30-4, were carefully crafted to comply with international standards, such as the Berne Convention and the WIPO Copyright Treaty, ensuring that they meet the required legal frameworks.
Welcome diverse approaches to training data within a unified Open Source AI Definition
  • @arandal emphasizes the importance of the Open Source Definition (OSD) as a unifying framework that accommodates diverse approaches within the open-source community. She argues that AI models, being a combination of source code and training data, should have their diversity in handling data explicitly recognized in the Open Source AI Definition. She proposes specific text changes to the draft to clarify that while some developers may be comfortable with proprietary data, others may not, and both approaches should be supported to ensure the long-term success of open-source AI.
  • @mjbommar appreciates the spirit of Arandal’s proposal but adds that the OSI currently lacks specific licenses for data, which is why it is crucial for the OSI to collaborate with Creative Commons. Creative Commons maintains the ecosystem of “data licenses” that would be necessary under the proposed revisions to the Open Source AI Definition.
  • @arandal agrees with the need for collaboration with organizations like Creative Commons, noting that this coordination is already reflected in checklist v. 0.0.9. She suggests that such collaboration is necessary even without the proposed revisions to ensure the definition accurately addresses data licensing in AI.
  • @nick acknowledges the importance of working with organizations like Creative Commons and mentions that OSI is in ongoing communication with several relevant organizations, including MLCommons, the Open Future Foundation, and the Data and Trust Alliance. He highlights the recent publication of the Data Provenance Standards by the Data and Trust Alliance as an example of the kind of collaborative work that is being pursued.
  • @mjbommar reiterates the need for explicit coordination with Creative Commons, arguing that the OSI cannot realistically finalize the Open Source AI Definition without such collaboration. He also suggests that the OSI should explore AI preference signaling and work with Creative Commons and SPDX/LF to establish shared standards, which should be part of the OSAID standard’s roadmap.

Join this week’s town hall to hear the latest developments, give your comments and ask questions.

Register for the townall
Categories: FLOSS Research

Ezequiel Lanza: Voices of the Open Source AI Definition

Wed, 2024-08-28 09:38

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Ezequiel Lanza

What’s your background related to Open Source and AI?

I’ve been working in AI for more than 10 years (Yes, before ChatGPT!). With a background in engineering, I’ve consistently focused on building and supporting AI applications, particularly in machine learning and data science. Over the years, I’ve contributed to and collaborated on various projects. A few years ago, I decided to pursue a master’s in data science to deepen my theoretical knowledge and further enhance my skills. Open Source has also been a significant part of my work; the frameworks, tools and community have continually drawn me in, making me an active participant in this evolving conversation for years.

What motivated you to join this co-design process to define Open Source AI?

AI owes much of its progress to Open Source, and it’s essential for continued innovation. My experience in both AI and Open Source spans many years, and I believe this co-design process offers a unique chance to contribute meaningfully. It’s not just about sharing my insights but also about learning from other professionals across AI and different disciplines. This collective knowledge and diverse perspectives make this initiative truly powerful and enriching, to shape the future of Open Source AI together.

Can you describe your experience participating in this process? What did you most enjoy about it, and what were some of the challenges you faced?

Participating in this process has been both rewarding and challenging. I’ve particularly enjoyed engaging with diverse groups and hearing different perspectives. The in-person events, such as All Things Open in Raleigh in 2023, have been valuable for fostering direct collaboration and building relationships. However, balancing these meetings with my work duties has been challenging. Coordinating schedules and managing time effectively to attend all the relevant discussions can be demanding. Despite these challenges, the insights and progress have made the effort worthwhile.

Why do you think AI should be Open Source?

We often say AI is everywhere, and while that’s partially true, I believe AI will be everywhere, significantly impacting our lives. However, AI’s full potential can only be realized if it is open and accessible to everyone. Open Source AI should also foster innovation by enabling developers and researchers from all backgrounds to contribute to and improve existing models, frameworks and tools, allowing freedom of expression. Without open access, involvement in AI can be costly, limiting participation to only a few large companies. Open Source AI should aim to democratize access, allowing small businesses, startups and individuals to leverage powerful tools that might otherwise be out of reach due to cost or proprietary barriers.

What do you think is the role of data in Open Source AI?

Data is essential for any AI system. Initially, from my ML bias perspective, open and accessible datasets were crucial for effective ML development. However, I’ve reevaluated this perspective, considering how to adapt the system while staying true to Open Source principles. As AI models, particularly GenAI like LLMs, become increasingly complex, I’ve come to value the models themselves. For example, Generative AI requires vast amounts of data, and gaining access to this data can be a significant challenge.

This insight has led me to consider what I—whether as a researcher, developer or user—truly need from a model to use/investigate it effectively. While understanding the data used in training is important, having access to specific datasets may not always be necessary. In approaches like federated learning, the model itself can be highly valuable while keeping data private, though understanding the nature of the data remains important. For LLMs, techniques such as fine-tuning, RAG and RAFT emphasize the benefits of accessing the model rather than the original dataset, providing substantial advantages to the community.

Sharing model architecture and weights is crucial, and data security can be maintained through methods like model introspection and fine-tuning, reducing the need for extensive dataset sharing.

Data is undoubtedly a critical component. However, the essence of Open Source AI lies in ensuring transparency, then the focus should be on how data is used in training models. Documenting which datasets were used and the data handling processes is essential. This transparency helps the community understand the origins of the data, assess potential biases and ensure the responsible use of data in model development. While sharing the exact datasets may not always be necessary, providing clear information about data sources and usage practices is crucial for maintaining trust and integrity in Open Source AI.

Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?

Of course, it changed and evolved – that’s what a thought process is about. I’d be stubborn if I never changed my perspective along the way. I’ve often questioned even the most fundamental concepts I’ve relied on for years, avoiding easy or lazy assumptions. This thorough process has been essential in refining my understanding of Open Source AI. Engaging in meaningful exchanges with others has shown me the importance of practical definitions that can be implemented in real-world scenarios. While striving for an ideal, flawless definition is tempting, I’ve found that embracing a pragmatic approach is ultimately more beneficial.

What do you think the primary benefit will be once there is a clear definition of Open Source AI?

As I see it, the Open Source AI Definition will support the growth, and it will be the first big step. The primary benefit of having a clear definition of Open Source AI will be increased clarity and consistency in the field. This will enhance collaboration by setting clear standards and expectations for researchers, developers and organizations. It will also improve transparency by ensuring that AI models and tools genuinely follow Open Source principles, fostering trust in their development and sharing.

A clear definition will create standardized practices and guidelines, making it easier to evaluate and compare different Open Source AI projects.

What do you think are the next steps for the community involved in Open Source AI?

The next steps for the community should start with setting up a certification process for AI models to ensure they meet certain standards. This could include tools to help automate the process. After that, it would be helpful to offer templates and best practice guides for AI models. This will support model designers in creating high-quality, compliant systems and make the development process smoother and more consistent.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Categories: FLOSS Research

Three things I learned at KubeCon + AI_Dev China 2024

Tue, 2024-08-27 14:10

KubeCon China 2024 was a whirlwind of innovation, community and technical deep dives. As it often happens at these community events, I was blown away by the energy, enthusiasm and sheer amount of knowledge being shared. Here are three key takeaways that stood out to me:

1. The focus on AI and machine learning

AI and machine learning are increasingly integrated into cloud-native applications. At KubeCon China, I saw numerous demonstrations of how these technologies are being used to automate tasks, optimize resource utilization and improve application performance. From AI-powered observability tools to machine learning-driven anomaly detection, the potential for AI and ML in the cloud-native space is astounding.

Mer Joyce and Anni Lai introduced the new draft of the Open Source AI Definition (v.0.0.9) and the Model Openness Framework. 

We also saw a robot on stage demonstrating that teaching a robotic arm to use a spoon to help disabled people is not a programming issue but a data issue. This was probably my biggest learning moment: A robot can be “taught” to execute tasks by imitating humans. Follow Xavier Tao and the dora-rs project.

2. The growing maturity of cloud-native technologies

It’s clear that cloud-native technologies have come of age. From Kubernetes adoption to the rise of serverless platforms and edge computing, the ecosystem is thriving. In his keynote, Chris Aniszczyk announced over 200 projects are hosted by the Cloud Native Computing Foundation and half of the contributors are not in the US. The conference showcased a wide range of tools, frameworks and use cases that demonstrate the versatility and scalability of cloud-native architectures.

The presentation by Kevin Wang (Huawei) and Saint Jiang (NIO) showed how Containerd, Kubernetes and KubeEdge power the transition to electric vehicles. Modern cars are computers… no, cars are full datacenters on wheels, a collection of sensors feeding distributed applications to optimize battery usage, feeding into centralized programs to constantly improve the whole mobility system.

3. AI technology is removing the language barrier

I was absolutely amazed by being able to follow the keynote sessions delivered in Chinese. I don’t speak Chinese but I could read the automatic translation in real time superimposed on the slides behind the speakers. This technology is absolutely jaw-droppingly amazing! Within a few years, there won’t be a career for simultaneous translators or for live transcribers.

Final thoughts

KubeCon + AI_Dev China was a testament to the power of Open Source collaboration hosted in one of the most amazing regions of the world. The conference brought together developers, operators and end-users from around the world to share their experiences, best practices and contributions to Open Source projects. This collaborative spirit is essential for driving innovation and ensuring the long-term success of cloud-native technologies.

Categories: FLOSS Research

Open Source AI – Weekly update August 26

Mon, 2024-08-26 15:33
Week 34 summary  Share your thoughts about draft v0.0.9

As we move toward the release of the first-ever Open Source AI Definition in October at All Things Open, the publication of the 0.0.9 draft brings us one step closer to realizing this goal.

  • OSAID 0.0.9 draft definition is live
  •   Changelog includes:
    • New Feature: Clarified Open Source Models and Weights
      • Added a new paragraph under “What is Open Source AI” to define “system” as including both models and weights.
      • Clarified that all components of a larger system must meet the standard.
      • Updated paragraph after the “share” bullet to emphasize this point.  
    • New Section: Open Source Models and Open Source Weights
      • Added descriptions of components for both models and weights in machine learning systems.
      • Edited subsequent paragraphs to eliminate redundancy.
    • Training Data: Defined as a Benefit, Not a Requirement
      • Defined open, public, and unshareable non-public training data.
      • Explained the role of training data in studying AI systems and understanding biases.
      • Emphasized extra requirements for data to advance openness, especially in private-first areas like healthcare.
    • Separation of Checklist
      • The Checklist is now a separate document from the main Definition.
      • Fully aligned Checklist content with the Model Openness Framework (MOF).
    • Terminology Changes
      • Replaced “Model” with “Weights” under “Preferred form to make modifications” for consistency.
    • Explicit Reference to Recipients of the Four Freedoms
      • Added specific references to developers, deployers, and end users of AI systems.
    • Credits and References
      • Incorporated credit to the Free Software Definition.
      • Added references to conditions of availability of components, referencing the Open Source Definition.
  • Initial reactions on the forum: 
    • @shujisado praises the updates in version 0.0.9, particularly the decision to separate the checklist from the main document, which clarifies the intent behind OSAID. He also supports the separation of “code” and “weights,” noting that in Japan, “code” clearly falls under copyright, making this distinction logical. He acknowledges revisions in the checklist that consider the importance of complete datasets, even though he disagrees with making datasets mandatory. 
  • Comments on the draft on HackMD
    • @Joshua Gay adds that instead of narrowing the focus to machine-learning systems, the emphasis should be on “parameters” as a whole since weights are just one type of parameter. He suggests a rewrite that highlights making model parameters, such as weights and other settings, available under OSI-approved terms, with examples across various AI models.
      • He further suggests using broader language that covers more AI systems instead of narrower terminology. Specifically, he proposes replacing “Open Source models and Open Source weights” with “Open Source models and Open Source parameters,” and using “AI systems” instead of “machine learning systems.” Additionally, he recommends redefining an AI model to include architecture, parameters like weights and decision boundaries, and inference code, while referring to AI parameters as configuration settings that produce outputs from inputs.
    • Under “Open Source models and Open Source weights”, @shujisado adds that the last paragraph titled “Open Source models and Open Source weights” actually explains “AI model” and “AI weights,” leading to a mismatch between the title and content, and notes that these terms are not used elsewhere in the definition.
    • Under “Preferred form to make modifications to machine-learning systems”, @shujisado suggests some grammatical corrections.
  • Next steps
    • The OSI has recently presented at the following events: 
    • Iterate Drafts: Continue refining drafts with feedback from the worldwide roadshow, considering new dissenting opinions.
    • Review Licenses: Decide on the best approach for reviewing new licenses for datasets, documentation, and model parameters.
    • Enhance FAQ: Continue improving the FAQ to address emerging questions.
    • Post-Stable Release Plan: Establish a process for reviewing and updating future versions of the Open Source AI Definition.
 Explaining the concept of Data information
  •  @Kjetilk points out the legal distinction between using copyrighted works for AI training (reproduction) and incorporating them into publishable datasets, questioning the fairness of allowing exploitative models without compensation while potentially banning those that benefit society.
  • @Shujisadoclarifies that compensation for copyrighted works used in AI training is possible for both open source and closed models, distinguishing it from “royalty,” and notes that Japan’s copyright law exempts such uses for machine learning.
    • @Kjetilk reiterates the relevance of “royalty” for compensation in closed, non-published models, suggesting it makes sense under copyright law if required, but if not, it could benefit science and the arts.
Open Source AI Definition Town Hall
  • The slides and recording from the town hall meeting held on August 23, 2024 are available here.
  • The next town hall meeting will be held on September 6th. Sign up for the event here.
Categories: FLOSS Research

Members Newsletter – August 2024

Mon, 2024-08-26 11:27
August 2024 Members Newsletter

The lively conversation about the role of data in building and modifying AI systems will continue as the OSI travels to China this month for AI_dev (August 21-23 in Hong Kong) and Open Source Congress (August 25-27 in Beijing). The OSI has been able to chime in on news stories on the topic, several of which are linked here in the newsletter.

Last month the OSI was at the United Nations in New York City for OSPOs for Good, an event that covered key areas of open source policy, as well as emerging examples of ‘Open Source for good’ from across the globe. I participated in a panel on Open Source AI.

Creating an Open Source AI Definition has been an arduous task over the past couple of years, but we know the importance of creating this standard so the freedoms to use, study, share and modify AI systems can be guaranteed. Those are the core tenets of Open Source, and it warrants the dedicated work it has required. Please read about the people who have played key roles in bringing the Definition to life in our Voices of Open Source AI Definition on the blog.

Stefano Maffulli

Executive Director, OSI 

I hold weekly office hours on Fridays with OSI members: book time if you want to chat about OSI’s activities, if you want to volunteer or have suggestions.

News from the OSI OSI at the United Nations OSPOs for Good

From the Research and Advocacy program

Earlier this month the Open Source Initiative participated in the “OSPOs for Good” event promoted by the United Nations in NYC. Read more.

The Open Source Initiative joins CMU in launching Open Forum for AI: A human-centered approach to AI development

From the Research and Advocacy program

The Open Source Initiative (OSI) is pleased to share that we are joining the founding team of Open Forum for AI (OFAI), an initiative designed by Carnegie Mellon University (CMU). Read more

GUAC adopts license metadata from ClearlyDefined

From the License and Legal program

The software supply chain just gained some transparency thanks to an integration of the Open Source Initiative (OSI) project, ClearlyDefined, into GUAC (Graph for Understanding Artifact Composition), an OpenSSF project from the Linux Foundation. Read more.

Better identifying conda packages with ClearlyDefined

From the License and Legal program

ClearlyDefined now provides a new harvester implementation for conda, a popular package manager with a large collection of pre-built packages for various domains, including data science, machine learning, scientific computing and more. Read more.

OSI in the news Can AI even be open source? It’s complicated

OSI at ZDNet

AI can’t exist without open source, but the top AI vendors are unwilling to commit to open-sourcing their programs and data sets. To complicate matters further, defining open-source AI is a messy issue that has yet to be settled. Read more.

Open Source AI: What About Data Transparency?

OSI at The New Stack

AI uses both code and data, and this combination continues to be a challenge for open source, said experts at the United Nations OSPOs for Good Conference. Read more.

A new White House report embraces open-source AI

OSI at ZDNet

The National Telecommunications and Information Administration (NTIA) issued a report supporting open-source and open models to promote innovation in AI, while emphasizing the need for vigilant risk monitoring. Read more.

With Open Source Artificial Intelligence, Don’t Forget the Lessons of Open Source Software

OSI at CISA

While there is not yet a consensus on the definition of what constitutes “open source AI”, the Open Source Initiative, which maintains the “Open Source Definition” and a list of approved OSS licenses, has been “driving a multi-stakeholder process to define an ‘Open Source AI’”. Read more.

Meta inches toward open source AI with new LLaMA 3.1

OSI at ZDNet

Is Meta’s 405 billion parameter model really open source? Depends on who you ask. Here’s how to try out the new engine for yourself​. Read more.

Other news News from OSI affiliates News from OpenSource.net Voices of the Open Source AI Definition

The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process.

7th annual OSPO and Open Source Management Survey

The TODO Group and Linux Foundation Research, in partnership with Cisco, NGINX, Open Source Initiative, InnerSource Commons, and CHAOSS, are excited to be launching the 7th annual OSPO and Open Source Management survey! Take survey here.

2024 Open Source Software Funding Survey

This survey tries to better understand how organizations fund, contribute to, and support open source software projects. This survey is a collaboration between GitHub, Inc., the Linux Foundation, and researchers from Harvard University. Take survey here.

Events Upcoming events Thanks to our sponsors New members and renewals
  • Cisco
  • Microsoft
  • Bloomberg
  • SAS
  • Intel
  • Look to the right

Interested in sponsoring, or partnering with, the OSI? Please see our Sponsorship Prospectus and our Annual Report. We also have a dedicated prospectus for the Deep Dive: Defining Open Source AI. Please contact the OSI to find out more about how your company can promote open source development, communities and software.

Support OSI by becoming a member!

Let’s build a world where knowledge is freely shared, ideas are nurtured, and innovation knows no bounds! 

Join the Open Source Initiative!

Categories: FLOSS Research

Community input drives the new draft of the Open Source AI Definition

Thu, 2024-08-22 08:30

A new version of the Open Source AI Definition has been released with one new feature and a cleaner text, based on comments received from public discussions and recommendations. We’re continuing our march towards having a stable release by the end of October 2024, at All Things Open. Get involved by joining the discussion on the forum, finding OSI staff around the world and online at the weekly town halls. 

New feature: clarified Open Source model and Open Source weights
  • Under “What is Open Source AI,” there is a new paragraph that (1) identifies both models and weights/parameters as encompassed by the word “system” and (2) makes it clear that all components of a larger system have to meet the standard. There is a new sentence in the paragraph after the “share” bullet making this point.
  • Under the heading “Open Source models and Open Source weights,” there is a description of the components for both of those for machine learning systems. We also edited the paragraph below those additions to eliminate some redundancy.
Training data in the preferred form to make modifications

The role of training data is one of the most hotly debated parts of the definition. After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.

Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

Data can be hard to share. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information, such as decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

  • Open training data  (data that can be reshared) provides the best way to enable users to study the system, along with the preferred form of making modifications.
  • Public training data  (data that others can inspect as long as it remains available) also enables users to study the work, along with the preferred form.
  • Unshareable non-public training data  (data that cannot be shared for explainable reasons) gives the ability to study some of the systems biases and demands a detailed description of the data – what it is, how it was collected, its characteristics, and so on – so that users can understand the biases and categorization underlying the system.

OSI believes these extra requirements for data beyond the preferred form of making modifications to the AI system both advance openness in all the components of the preferred form of modifying the AI system and drive more Open Source AI in private-first areas such as healthcare.

Other changes
  • The Checklist is separated into its own document. This is to separate the discussion about how to identify Open Source AI from the establishment of general principles in the Definition. The content of the Checklist has also been fully aligned with the Model Openness Framework (MOF), allowing for an easy overlay.
  • Under “Preferred form to make modifications,” the word “Model” changed to “Weights.” The word “Model”  was referring only to parameters, and was inconsistent with how the word “model” is used in the rest of the document.
  • There is an explicit reference to the intended recipients of the four freedoms: developers, deployers and end users of AI systems.
  • Incorporated credit to the Free Software Definition.
  • Added references to conditions of availability of components, referencing the Open Source Definition.
Next steps
  • Continue iterating through drafts after meeting diverse stakeholders at the worldwide roadshow, collect feedback and carefully look for new arguments in dissenting opinions. 
  • Decide how to best address the reviews of new licenses for datasets, documentation and the agreements governing model parameters.
  • Keep improving the FAQ.
  • Prepare for post-stable-release: Establish a process to review future versions of the Open Source AI Definition.
Collecting input and endorsements

We will be taking draft v.0.0.9 on the road collecting input and endorsements, thanks to a grant by the Sloan Foundation. The lively conversation about the role of data in building and modifying AI systems will continue at multiple conferences from around the world, the weekly town halls and online throughout the Open Source community. 

The first two stops are in Asia: Hong Kong for AI_dev August 21-23, then Beijing for Open Source Congress August 25-27. Other events are planned to take place in Africa, South America, Europe and North America. These are all steps toward the conclusion of the co-design process that will result in the release of the stable version of the Definition in October at All Things Open

Creating an Open Source AI Definition is an arduous task over the past two years, but we know the importance of creating this standard so the freedoms to use, study, share and modify AI systems can be guaranteed. Those are the core tenets of Open Source, and it warrants the dedicated work it has required. You can read about the people who have played key roles in bringing the Definition to life in our Voices of Open Source AI Definition on the blog

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
One of the many OSAID workshops organized by the OSI around the world
Categories: FLOSS Research

Mark Collier: Voices of the Open Source AI Definition

Fri, 2024-08-16 10:06

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Mark Collier

What’s your background related to Open Source and AI?

I’ve worked in Open Source most of my career, over 20 years, and have found it to be one of the greatest, if not the greatest, drivers of economic opportunity. It creates new markets and gives people all over the world access to not only use but to influence the direction of technologies. I started the OpenStack project and then the OpenStack Foundation, and later the Open Infrastructure Foundation. With members of our foundation from over 180 countries, I’ve seen firsthand how Open Source is the most efficient way to drive innovation. You get to crowdsource ideas from people all over the world, that are not just in one company or just in one country. We’ve certainly seen that with infrastructure in the cloud computing/edge computing era. AI is the next generation wave, with people investing literally trillions of dollars in building infrastructure, both physical and the software being written around it. This is another opportunity to embrace Open Source as a path to innovation. 

Open Source drives the fastest adoption of new technologies and gives the most people an opportunity to both influence it and benefit economically from it, all over the world. I want to see that pattern repeat in this next era of AI.

What motivated you to join this co-design process to define Open Source AI?

I’m concerned about the potential of there not being credible Open Source alternatives to the big proprietary players in this massive next wave of technology. It will be a bad thing for humanity if we can only get state-of-the-art AI from two or three massive players in one or two countries. In the same way we don’t want to see just one cloud provider or one software vendor, we don’t want any sort of monopoly or oligopoly in AI; That really slows innovation. I wanted to be part of this co-design process because it’s actually not trivial to apply the concept of Open Source to AI. We can carry over the principles and freedoms that underlie Open Source software, like the freedom to use it without restriction and the ability to modify it for different use cases, but an AI system is not just software. A whole debate has been stirred up about whether data needs to be released and published under an Open Source friendly license to be considered Open Source AI. That’s just one consideration of many that I wanted to contribute to. 

We have a very impressive group of people with a lot of diverse backgrounds and opinions working on this. I wanted to be part of the process not because I have the answers, but because I have some perspective and I can learn from all the others. We need to reach a consensus on this because if we don’t, the meaning of Open Source in the AI era will get watered down or potentially just lost all together, which affects all of Open Source and all of technology. 

Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?

The process started as a mailing list of sorts and evolved to more of an online discussion forum. Although it hasn’t always been easy for me to follow along, the folks at OSI that have been shepherding the process have done an excellent job summarizing the threads and bringing key topics to the top. Discussions are happening rapidly in the forum, but also in the press. There are new models released nearly every day it seems, and the bar for what are called Open Source models is causing a lot of noise. It’s a challenge for anybody to keep up but overall I think it’s been a good process.

Why do you think AI should be Open Source?

The more important a technology is to the future of the economy and the more a technology impacts our daily lives, the more critical it is that it be Open Source. For economic and participation reasons, but also for security. We have seen time and time again that transparency and openness breeds better security. With more mysterious and complex technologies such as AI, Open Source offers the transparency to help us understand the decisions the technology is making. There have been a number of large players who have been lobbying for more regulation, making it more difficult to have Open Source AI, and I think that shows a very clear conflict of interest. 

There is legislation out there that, if it gets passed, poses a real danger to not just Open Source AI, but Open Source in general. We have a real opportunity but also a real risk of there being conscious concentrations of power if state-of-the-art AI doesn’t fit a standard definition of Open Source. Open Source AI continues to be neck and neck with the proprietary models, which makes me optimistic.

Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?

My personal definition of Open Source AI is not set in stone even having been through this process for over a year. Things are moving so quickly, I think we need to be careful that perfect doesn’t become the enemy of good. Time is of the essence as the mainstream media and the tech press report on models that are trained on billions of dollars worth of hardware, claiming to be Open Source when they clearly are not. I’ve become more willing to compromise on an imperfect definition so we can come to a consensus sooner.

What do you think the primary benefit will be once there is a clear definition of Open Source AI?

All the reasons people love Open Source are inherently the same reasons why people are very tempted to put an Open Source label on their AI; trust, transparency, they can modify it and build their business on it, and the license won’t be changed. Once we finalize and ratify the definition, we can start broadly using it in practice. This will bring some clarity to the market again. We need to be able to point to something very clear and documented if we’re going to challenge a technology that has been labeled Open Source AI. Legal departments of big companies working on massive AI tools and workloads want to know that their license isn’t going to be pulled out from under them. If the definition upholds the key freedoms people expect from Open Source, it will lead to faster adoption by all.  

What do you think are the next steps for the community involved in Open Source AI?

I think Stefano from the OSI has done a wonderful job of trying to hit the conference circuit to share and collect feedback, and virtual participation in the process is still key to keeping it inclusive. I think the next step is building awareness in the press about the definition and market testing it. It’s an iterative process from there. 

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the working groups: be part of a team to evaluate various models against the OSAID.
  • Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
  • Comment on the latest draft: provide feedback on the latest draft document directly.
  • Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Categories: FLOSS Research

Update from the board of directors

Wed, 2024-08-07 12:13

The Chair of the Board of the OSI has acknowledged the resignation offered by Secretary of the Board, Aeva Black. The Chair and the entire Board would like to thank Black for their invaluable contribution to the success of OSI, as well of the entire Open Source Community, and for their service as board member and officer of the Initiative.

Categories: FLOSS Research

Pages