FLOSS Research
Community input drives the new draft of the Open Source AI Definition
A new version of the Open Source AI Definition has been released with one new feature and a cleaner text, based on comments received from public discussions and recommendations. We’re continuing our march towards having a stable release by the end of October 2024, at All Things Open. Get involved by joining the discussion on the forum, finding OSI staff around the world and online at the weekly town halls.
New feature: clarified Open Source model and Open Source weights- Under “What is Open Source AI,” there is a new paragraph that (1) identifies both models and weights/parameters as encompassed by the word “system” and (2) makes it clear that all components of a larger system have to meet the standard. There is a new sentence in the paragraph after the “share” bullet making this point.
- Under the heading “Open Source models and Open Source weights,” there is a description of the components for both of those for machine learning systems. We also edited the paragraph below those additions to eliminate some redundancy.
The role of training data is one of the most hotly debated parts of the definition. After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.
Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.
Data can be hard to share. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information, such as decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.
- Open training data (data that can be reshared) provides the best way to enable users to study the system, along with the preferred form of making modifications.
- Public training data (data that others can inspect as long as it remains available) also enables users to study the work, along with the preferred form.
- Unshareable non-public training data (data that cannot be shared for explainable reasons) gives the ability to study some of the systems biases and demands a detailed description of the data – what it is, how it was collected, its characteristics, and so on – so that users can understand the biases and categorization underlying the system.
OSI believes these extra requirements for data beyond the preferred form of making modifications to the AI system both advance openness in all the components of the preferred form of modifying the AI system and drive more Open Source AI in private-first areas such as healthcare.
Other changes- The Checklist is separated into its own document. This is to separate the discussion about how to identify Open Source AI from the establishment of general principles in the Definition. The content of the Checklist has also been fully aligned with the Model Openness Framework (MOF), allowing for an easy overlay.
- Under “Preferred form to make modifications,” the word “Model” changed to “Weights.” The word “Model” was referring only to parameters, and was inconsistent with how the word “model” is used in the rest of the document.
- There is an explicit reference to the intended recipients of the four freedoms: developers, deployers and end users of AI systems.
- Incorporated credit to the Free Software Definition.
- Added references to conditions of availability of components, referencing the Open Source Definition.
- Continue iterating through drafts after meeting diverse stakeholders at the worldwide roadshow, collect feedback and carefully look for new arguments in dissenting opinions.
- Decide how to best address the reviews of new licenses for datasets, documentation and the agreements governing model parameters.
- Keep improving the FAQ.
- Prepare for post-stable-release: Establish a process to review future versions of the Open Source AI Definition.
We will be taking draft v.0.0.9 on the road collecting input and endorsements, thanks to a grant by the Sloan Foundation. The lively conversation about the role of data in building and modifying AI systems will continue at multiple conferences from around the world, the weekly town halls and online throughout the Open Source community.
The first two stops are in Asia: Hong Kong for AI_dev August 21-23, then Beijing for Open Source Congress August 25-27. Other events are planned to take place in Africa, South America, Europe and North America. These are all steps toward the conclusion of the co-design process that will result in the release of the stable version of the Definition in October at All Things Open.
Creating an Open Source AI Definition is an arduous task over the past two years, but we know the importance of creating this standard so the freedoms to use, study, share and modify AI systems can be guaranteed. Those are the core tenets of Open Source, and it warrants the dedicated work it has required. You can read about the people who have played key roles in bringing the Definition to life in our Voices of Open Source AI Definition on the blog.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Mark Collier: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Mark CollierWhat’s your background related to Open Source and AI?
I’ve worked in Open Source most of my career, over 20 years, and have found it to be one of the greatest, if not the greatest, drivers of economic opportunity. It creates new markets and gives people all over the world access to not only use but to influence the direction of technologies. I started the OpenStack project and then the OpenStack Foundation, and later the Open Infrastructure Foundation. With members of our foundation from over 180 countries, I’ve seen firsthand how Open Source is the most efficient way to drive innovation. You get to crowdsource ideas from people all over the world, that are not just in one company or just in one country. We’ve certainly seen that with infrastructure in the cloud computing/edge computing era. AI is the next generation wave, with people investing literally trillions of dollars in building infrastructure, both physical and the software being written around it. This is another opportunity to embrace Open Source as a path to innovation.
Open Source drives the fastest adoption of new technologies and gives the most people an opportunity to both influence it and benefit economically from it, all over the world. I want to see that pattern repeat in this next era of AI.
What motivated you to join this co-design process to define Open Source AI?
I’m concerned about the potential of there not being credible Open Source alternatives to the big proprietary players in this massive next wave of technology. It will be a bad thing for humanity if we can only get state-of-the-art AI from two or three massive players in one or two countries. In the same way we don’t want to see just one cloud provider or one software vendor, we don’t want any sort of monopoly or oligopoly in AI; That really slows innovation. I wanted to be part of this co-design process because it’s actually not trivial to apply the concept of Open Source to AI. We can carry over the principles and freedoms that underlie Open Source software, like the freedom to use it without restriction and the ability to modify it for different use cases, but an AI system is not just software. A whole debate has been stirred up about whether data needs to be released and published under an Open Source friendly license to be considered Open Source AI. That’s just one consideration of many that I wanted to contribute to.
We have a very impressive group of people with a lot of diverse backgrounds and opinions working on this. I wanted to be part of the process not because I have the answers, but because I have some perspective and I can learn from all the others. We need to reach a consensus on this because if we don’t, the meaning of Open Source in the AI era will get watered down or potentially just lost all together, which affects all of Open Source and all of technology.
Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?
The process started as a mailing list of sorts and evolved to more of an online discussion forum. Although it hasn’t always been easy for me to follow along, the folks at OSI that have been shepherding the process have done an excellent job summarizing the threads and bringing key topics to the top. Discussions are happening rapidly in the forum, but also in the press. There are new models released nearly every day it seems, and the bar for what are called Open Source models is causing a lot of noise. It’s a challenge for anybody to keep up but overall I think it’s been a good process.
Why do you think AI should be Open Source?
The more important a technology is to the future of the economy and the more a technology impacts our daily lives, the more critical it is that it be Open Source. For economic and participation reasons, but also for security. We have seen time and time again that transparency and openness breeds better security. With more mysterious and complex technologies such as AI, Open Source offers the transparency to help us understand the decisions the technology is making. There have been a number of large players who have been lobbying for more regulation, making it more difficult to have Open Source AI, and I think that shows a very clear conflict of interest.
There is legislation out there that, if it gets passed, poses a real danger to not just Open Source AI, but Open Source in general. We have a real opportunity but also a real risk of there being conscious concentrations of power if state-of-the-art AI doesn’t fit a standard definition of Open Source. Open Source AI continues to be neck and neck with the proprietary models, which makes me optimistic.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
My personal definition of Open Source AI is not set in stone even having been through this process for over a year. Things are moving so quickly, I think we need to be careful that perfect doesn’t become the enemy of good. Time is of the essence as the mainstream media and the tech press report on models that are trained on billions of dollars worth of hardware, claiming to be Open Source when they clearly are not. I’ve become more willing to compromise on an imperfect definition so we can come to a consensus sooner.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
All the reasons people love Open Source are inherently the same reasons why people are very tempted to put an Open Source label on their AI; trust, transparency, they can modify it and build their business on it, and the license won’t be changed. Once we finalize and ratify the definition, we can start broadly using it in practice. This will bring some clarity to the market again. We need to be able to point to something very clear and documented if we’re going to challenge a technology that has been labeled Open Source AI. Legal departments of big companies working on massive AI tools and workloads want to know that their license isn’t going to be pulled out from under them. If the definition upholds the key freedoms people expect from Open Source, it will lead to faster adoption by all.
What do you think are the next steps for the community involved in Open Source AI?
I think Stefano from the OSI has done a wonderful job of trying to hit the conference circuit to share and collect feedback, and virtual participation in the process is still key to keeping it inclusive. I think the next step is building awareness in the press about the definition and market testing it. It’s an iterative process from there.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Update from the board of directors
The Chair of the Board of the OSI has acknowledged the resignation offered by Secretary of the Board, Aeva Black. The Chair and the entire Board would like to thank Black for their invaluable contribution to the success of OSI, as well of the entire Open Source Community, and for their service as board member and officer of the Initiative.
Jean-Pierre Lorre: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Jean-Pierre LorreWhat’s your background related to Open Source and AI?
I’ve been using Open Source technologies since the very beginning of my career and have been directly involved in Open Source projects for around 20 years.
I graduated in artificial intelligence engineering in 1985. Since then I have worked in a number of applied AI research structures in fields such as medical image processing, industrial plant supervision, speech recognition and natural language processing. My knowledge covers both symbolic AI methods and techniques and deep learning.
I currently lead a team of around fifteen AI researchers at LINAGORA. LINAGORA is an Open Source company.
What motivated you to join this co-design process to define Open Source AI?
The team I lead is heavily involved in the development of LLM generative models, which we want to distribute under an open license. I realized that the term Open Source AI was not defined and that the definition we had at LINAGORA was not the same as the one adopted by our competitors.
As the OSI is the leading organization for defining Open Source and there was a project underway to define the term Open Source AI, I decided to join it.
Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?
I participated in two ways: firstly, to provide input for the definition currently being drafted; and secondly, to evaluate LLM models with regard to the definition (I contributed to Bloom, Falcon and Mistral).
For the first item, my main difficulty was keeping up with the meandering discussions, which were very active. I didn’t manage to do so completely, but I was able to appreciate the summaries provided from time to time, which enabled me to follow the overall thread.
The second difficulty concerns the evaluation of the models: the aim of the exercise was to evaluate the consistency of OSAID version 0.8 on models that currently claim to be “Open Source.” Implementing the definition involves looking for information that is sometimes non-existent and sometimes difficult to find.
Why do you think AI should be Open Source?
Artificial intelligence models are expected to play a very important role in our professional lives, but also in our everyday lives. In this respect, the need for transparency is essential to enable people to check the properties of the models. They must also be accessible to as many people as possible, to avoid widening the inequalities between those who have the means to develop them and those who will remain on the sidelines of this innovation. Similarly, they might be adapted for different uses without the need for authorization.
The Open Source approach makes it possible to create a community such as the one created by LINAGORA, OpenLLM-Europe. This is a way for small players to come together to build the critical mass needed not only to develop models but also to disseminate them. Such an approach, which may be compared to that associated with the digital commons, is a guarantee of sovereignty because it allows knowledge and governance to be shared.
In short, they are the fruit of work based on data collected from as many people as possible, so they must remain accessible to as wide an audience as possible.
What do you think is the role of data in Open Source AI?
Data provides the basis for training models. It is therefore the pool of information from which the knowledge displayed by the model and the applications deduced from it will be drawn. In the case of an open model, the dissemination of as many elements as possible to qualify this data is a means of transparency that facilitates the study of the model’s properties; indeed, this data is likely to include cultural bias, gender, ethnic origin, skin color, etc. It is also a means of facilitating the study of the model’s properties. It also makes it easier to modify the model and its outputs.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
Yes, we initially thought that the provision of training data was a sine qua non condition for the design of truly Open Source models. Our basic assumption was that the model may be seen as a work derived from the data and that therefore the license assigned to the data, in particular the non-commercial nature, had an impact on the license of the model. As the discussions progressed, we realized that this condition was very restrictive and severely limited the possibility of developing models.
Our current analysis is that the condition defined in version 0.8 of the OSAID is sufficient to provide the necessary guarantees of transparency for the four freedoms and in particular the freedom to study the model underlying access to data. With regard to the data, it stipulates that “sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data” must be provided. Even if we can agree that this condition seems difficult to satisfy without providing the data sets, other avenues may be envisaged, in particular the provision of synthetic data. This information should make it possible to carry out almost all of the model’s studies.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
Having such a definition with clear, implementable rules will provide model suppliers with a concrete framework for producing models that comply with the ethics of the Open Source movement.
A collateral effect will be to help sort out the “wheat from the chaff.” In particular, to detect attempts at “Open Source washing.” This definition is therefore a structuring element for a company such as LINAGORA, which wants to build a sustainable business model around the provision of value-added AI services.
It should also be noted that such a definition is necessary for regulations such as the European IA Act, which defines exceptions for Open Source generative models. Such legislative construction cannot be satisfied with a fuzzy basis.
What do you think are the next steps for the community involved in Open Source AI?
The next steps that need to be addressed by the community concern firstly the definition of a certification process that will formalize the conformity of a model; this process may be accompanied by tools to automate it.
In a second phase, it may also be useful to provide templates of AI models that comply with the definition, as well as best practice guides, which would help model designers.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
GUAC adopts license metadata from ClearlyDefined
The software supply chain just gained some transparency thanks to an integration of the Open Source Initiative (OSI) project, ClearlyDefined, into GUAC (Graph for Understanding Artifact Composition), an OpenSSF project from the Linux Foundation. GUAC provides a comprehensive mapping of software packages, dependencies, vulnerabilities, attestations, and more, allowing organizations to achieve better compliance and security of their software supply chain.
GUAC offers the full view of the supply chainSoftware supply chain attacks are on the rise. Many tools are available to help generate software bills of materials (SBOMs), signed attestations and vulnerability reports, but they stop there, leaving users to figure out how they all fit together. GUAC provides an aggregated, queryable view across the whole software supply chain, not just one SBOM at a time.
GUAC is for developers, operations and security practitioners who need to identify and address problems in their software supply chain, including proactively managing dependencies and responding to vulnerabilities. GUAC provides supply chain observability with a graph view of the software supply chain and tools for performing queries to gain actionable insights.
GUAC enhanced with ClearlyDefined integrationThe latest version of GUAC (v0.8.0) now provides support for ClearlyDefined. GUAC will query the ClearlyDefined license metadata store to discover license information for packages, even when the SBOM does not include that information.
A ClearlyDefined certifier will listen on collector-subscriber for any pkg/src strings, then convert to ClearlyDefined coordinates, then query the API service for the definition. The user agent will be the same as existing outgoing GUAC requests GUAC/<version> (e.g. GUAC/v0.1.0).
A CertifyLegal node will be created using the “licensed” “declared” field from the definition. The expression will be copied and any license identifiers found will result in linked License noun nodes, created if needed. Type will be “declared”. Justification will be “Retrieved from ClearlyDefined”. Time will be the current time the information was retrieved from the API.
Similarly a node will be created using the “licensed” “facets” “core” “discovered” “expressions” field. Multiple expressions will be “AND”ed together. Type will be “discovered”, and other fields the same (Time, Justification, License links, etc).
The “licensed” “facets” “core” “attribution” “parties” array will be concatenated and stored in the Attribution field on CertifyLegal.
Optionally, “described” “sourceLocation” can be used to create a HasSourceAt GUAC node.
Thanks to the communityAlthough licenses don’t directly impact security, they are an important part of understanding the software supply chain. We would like to thank Parth Patel (Kusari), Jeff Mendoza (Kusari), Ben Cotton (Kusari), and Qing Tomlinson (SAP) for their support to get this feature implemented in GUAC. The ClearlyDefined community looks forward to working together with the GUAC community to help organizations worldwide to better achieve compliance and security of their software supply chain.
Deshni Govender: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Deshni GovenderWhat’s your background related to Open Source and AI?
I am the South Africa country focal point for the German Development Cooperation initiative “FAIR Forward – Artificial Intelligence for All” and the project strives for a more open, inclusive and sustainable approach to AI on an international level. More significantly, we seek to democratize the field of AI, to enable more robust, inclusive and self-determined AI ecosystems. Having worked in private sector and then now being in international development, my attention has been drawn to the disparity between the power imbalances of proprietary vs open and how this results in economic barriers for global majority, but also creates further harms and challenges for vulnerable populations and marginalized sectors, especially women. This fuelled my journey of working towards bridging the digital divide and digital gender gap through democratizing technology.
Some projects I am working on in this space include developing data governance models for African NLP (with Masakhane Foundation) and piloting new community-centered, equitable license types for voice data collection for language communities (with Mozilla).
What motivated you to join this co-design process to define Open Source AI?
I have experienced first hand the power imbalances that exist in geo-politics, but also in the context of economics where global minority countries shape the ‘global trajectory’ of AI without global voices. The definition of open means different things to different people / ecosystems / communities, and all voices should be heard and considered. Defining open means the values and responsibilities attached to it should be considered in a diverse manner, else the context of ‘open’ is in and of itself a hypocrisy.
Why do you think AI should be Open Source?
An enabling ecosystem is one that benefits all the stakeholders and ecosystem components. Inclusive efforts must be outlaid to explore and find tangible actions or potential avenues on how to reconcile the tension between openness, democracy and representation in AI training data whilst preserving community agency, diverse values and stakeholder rights. However, the misuse, colonization and misinterpretation of data continues unabated. Much of African culture and knowledge is passed down generations by story telling, art, dance and poetry and is done so verbally or through different ways of documentation, and in local manners and nuances of language. It is rarely digitized and certainly not in English. Language is culture and culture is context, yet somehow we find LLMs being used as an agent for language and context. Solutions and information are provided about and for communities but not with those communities, and the lack of transparency and post-colonial manipulation of data and culture is both irresponsible and should be considered a human rights violation.
Additionally, Open Source and open systems enable nations to develop inclusive AI policy processes so that policymakers from Global South countries can draw from peer experience on tackling their AI policies and AI-related challenges to find their own approaches to AI policy. This will also challenge dependence from and domination by western centric / Global North countries on AI policies to push a narrative or agenda on ‘what’ and ‘how’; i.e. Africa / Asia / LATAM must learn from us how to do X (since we hold the power, we can determine the extent and cost – exploitative). We aim for government self-determination and to empower countries, so that they may collectively have a voice on the global stage.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
My personal definition has not changed but it has been refreshing to witness the diverse views on how open is defined. The idea that behavior (e.g. of tech oligopolies) could reshape the way we define an idea or concept was thought-provoking. It means therefore that as emerging technology evolves, the idea of ‘open’ could change still in the future, depending on the trajectory of emerging technology and the values that society holds and attributes.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
A clear and more inclusive definition of Open Source AI would commerce a wave towards making data injustice, data invisibility, data extractivism, and data colonialism more visible and for which there exists repercussions. It would spur open, inclusive and responsible repositories of data, data use, and more importantly accuracy of use and interpretation. I am hoping that this would also spur innovative ways on how to track and monitor / evaluate use of Open Source data, so that local and small businesses are encouraged to develop in an Open Source while still being able to track and monitor players who extract and commercialize without giving back.
Ideally it would begin the process (albeit transitional) of bridging the digital divide between source and resource countries (i.e. global majority where data is collected from versus those who receive and process data for commercial benefit).
What do you think are the next steps for the community involved in Open Source AI?
If we make everything Open Source, it encourages sharing and use in developing and deploying, offers transparency and shared learning but enables freeriding. However the corollary is that closed models such as copyright prioritize proprietary information and commercialisation but can limit shared innovation, and does not uphold the concept of communal efforts, community agency and development. How do we quell this tension? I would like to see the Open Source community working to find practical and actionable ways in which we can make this work (open, responsible and innovative but enabling community benefit / remuneration).
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Hailey Schoelkopf: Voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Hailey Schoelkopf What’s your background related to Open Source and AI?One of the main reasons I was able to get more deeply involved in AI research was through open research communities such as the BigScience Workshop and EleutherAI, where discussions and collaboration were available to outsiders. These opportunities to share knowledge and learn from others more experienced than me were crucial to learning about the field and growing as a practitioner and researcher.
I co-lead the training of the Pythia language models (https://arxiv.org/abs/2304.01373), some of the first fully-documented and reproducible large-scale language models with as many related artifacts as possible released Open Source. We were happy and lucky to see these models fill a clear need, especially in the research community, where Pythia has since contributed to a large amount of studies attempting to build our understanding of LLMs, including interpreting their internals, understanding the process by which these models improve over training, and disentangling some of the effects of the dataset contents on these models’ downstream behavior.
What motivated you to join this co-design process to define Open Source AI?There has been a significant amount of confusion induced by the fact that not all ‘open-weights’ AI models released are released under OSI-compliant licenses-–or impose restrictions on their usage or adaptation-–so I was excited that OSI was working on reducing this confusion by producing a clear definition that could be used by the Open Source community. I more directly joined the process by helping discuss how the Open Source AI Definition could be mapped onto the Pythia language models and the accompanying artifacts we released.
Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?Deciding what counts as sufficient transparency and modifiability to be Open Source was an interesting problem. Although public model weights are very beneficial to the Open Source community, releasing model weights without sufficient detail to understand the model and its development process to make modifications or understand reasons behind its design and resulting characteristics can hinder understanding or prevent the full benefits of a completely Open Source model from being realized.
Why do you think AI should be Open Source?There are clear advantages to having models that are Open Source. Access to such fully-documented models can help a much, much broader group of people–trained researchers and also many others–who can use, study, and examine these models for their own purposes. While not every model should be made Open Source under all conditions, wider scrutiny and study of these models can help increase our understanding of AI systems’ behavior, raise societal preparedness and awareness of AI capabilities, and improve these models’ safety by allowing more people to understand them and explore their flaws.
With the Pythia language models, we’ve seen many researchers explore questions around the safety and biases of these models, including a breadth of questions we’d not have been able to study ourselves, or many that we could not even anticipate. These different perspectives are a crucial component in making AI systems safer and more broadly beneficial.
What do you think is the role of data in Open Source AI?Data is a crucial component of AI systems. Transparency around (and, potentially, open release of) training datasets can enable a wide range of extended benefits to researchers, practitioners, and society at large. I think that for a model to be truly Open Source, and to derive the greatest benefits from its openness, information on training data must be shared transparently. This information also importantly allows various members of the Open Source community to avoid replicating each other’s work independently. Transparent sharing about motivations and findings with respect to dataset creation choices can improve the community’s collective understanding of system and dataset design for the future and minimize overlapping, wasted effort.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?An interesting perspective that I’ve grown to appreciate is that the Open Source AI definition includes public and Open Source licensed training and inference code. Actually making one’s Open Source AI model effectively usable by the community and practitioners is a crucial step of promoting transparency, though not often enough discussed.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?Having a clear definition of Open Source AI can make it clearer where existing currently “open” systems fall, and potentially encourage future open-weights models to be released with more transparency. Many current open-weights models are shared under bespoke licenses with terms not compliant with Open Source principles–this creates legal uncertainty and also makes it less likely that a new open-weights model release will benefit practitioners at large or contribute to better understanding of how to design better systems. I would hope that a clearer Open Source AI definition will make it easier to draw these lines and encourage those currently releasing open-weights models to do so in a way more closely fitting the Open Source AI standard.
What do you think are the next steps for the community involved in Open Source AI?An exciting future direction for the Open Source AI research community is to explore methods for greater control over AI model behavior; attempting to explore approaches to collective modification and collaborative development of AI systems that can adapt and be “patched” over time. A stronger understanding of how to properly evaluate these systems for capabilities, robustness, and safety will also be crucial. I hope to see the community direct greater attention to evaluation in the future as well.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
OSI at the United Nations OSPOs for Good
Earlier this month the Open Source Initiative participated in the “OSPOs for Good” event promoted by the United Nations in NYC. Stefano Maffulli, the Executive Director of the OSI, participated in a panel moderated by Mehdi Snene about Open Source AI alongside distinguished speakers Ashley Kramer, Craig Ramlal, Sasha Luccioni, and Sergio Gago. Please find below a transcript of Stefano’s presentation.
Mehdi Snene
What is Open Source in AI? What does it mean? What are the foundational pieces? How far along is the data? There is mention of weights, and data skills. How can we truly understand what Open Source in AI is? Today, joining us, we’ll have someone who can help us understand what Open Source in AI means and where we are heading. Stefano, can you offer your insights?
Stefano Maffulli
Thanks. We have some thoughts on this. We’ve been pondering these questions since they first emerged when GPT started to appear. We asked ourselves: How do we transfer the principles of permissionless innovation and the immense value created by the Open Source ecosystem into the AI space?
After a little over two years of research and global conversations with multiple stakeholders, we identified three key elements. Firstly, permissionless innovation needs to be ported to AI, but this is complex and must be broken down into smaller components.
We realized that, as developers, users, and deployers of AI systems, we need to understand how these systems are built. This involves studying all components carefully, being able to run them for any purpose without asking for permission (a basic tenet of Open Source), and modifying them to change outputs based on the same inputs. These basic principles include being able to share these modifications with others.
To achieve this, you need data, the code used for training and cleaning the data (e.g., removing duplicates), the parameters, the weights, and a way to run inference on those weights. It’s fairly straightforward. However, the challenge lies in the legal framework.
Now, the complicated piece is how Open Source software has had a very wonderful run, based on the fact that the legal framework that governs Open Source is fairly simple and globally accepted. It’s built on copyright, a system that has worked wonderfully in both ways. It gives exclusive rights to the content creators, but also the same mechanism can be used to grant rights to anyone who receives the creation.
With data, we don’t have that mechanism. That is a very simple and dramatic realization. When we talk about data, we should pay attention to what kind of data we’re discussing. There is data as content created, and there is data as facts; like fires, speed limits, or traces of a road. Those are facts, and they have different ways of being treated. There is also private data, personal information, and various other kinds of data, each with different rules and regulations around the world.
Governments’ major role in the future will be to facilitate permissionless innovation in data by harmonizing these rules. This will level the playing field, where currently larger corporations have significantly more power than Open Source developers or those wishing to create large language models. Governments should help create datasets, remove barriers, and facilitate access for academia, smaller developers, and the global south.
Mehdi Snene
We already have open data and Open Source. Now, we need to create open AI and open models. Are we bringing these two domains together and keeping them separate, or are we creating something new from scratch when we talk about open AI?
Stefano Maffulli
This is a very interesting and powerful question. I believe that open data as a movement has been around for quite a while. However, it’s only recently that data scientists have truly realized the value they hold in their hands. Data is fungible and can be used to build new things that are completely different from their original domains.
We need to talk more about this and establish platforms for better interaction. One striking example is a popular dataset of images used for training many image generation AI tools, which contained child sexual abuse images for many years. A research paper highlighted this huge problem, but no one filed a bug report, and there was no easy way for the maintainers of this dataset to notice and remove those images.
There are things that the software world understands very well, and things that data scientists understand very well. We are starting to see the need for more space for interactions and learning from each other.
The conversation is extremely complicated. Alex and I have had long discussions about this. I don’t want to focus entirely on this, but I do want to say that Open Source has never been about pleasing companies or specific stakeholders. We need to think of it as an ecosystem where the balances of power are maintained.
While Open Source software and Open Source AI are still evolving, the necessary ingredients—data, code, and other components—are there. However, the data piece still needs to be debated and finalized. Pushing for radical openness with data has clear drawbacks and issues. It’s going to be a balance of intentions, aiming for the best outcome for the general public and the whole ecosystem.
Mehdi Snene
Thank you so much. My next question is about the future. What are your thoughts on the next big technology?
Stefano Maffulli
From the perspective of open innovation, it’s about what’s going to give society control over technology. The focus of Open Source has always been to enable developers and end-users to have sovereignty over the technology they use. Whether it’s quantum computers, AI, or future technologies, maintaining that control is crucial.
Governments need to play a role in enabling innovation and ensuring that no single power becomes too dominant. The balance between the private sector, public sector, nonprofit sector, and the often-overlooked fourth sector—which includes developers and creators who work for the public good rather than for profit—must be maintained. This balance is essential for fostering an ecosystem where all stakeholders have equal interests and influence.
If you would like to listen to the panel discussion in its entirety, you can do so here (the Open Source AI panel starts at 1:00:00 approximately).
Better identifying conda packages with ClearlyDefined
ClearlyDefined, an Open Source project that helps organizations with supply chain compliance, now provides a new harvester implementation for conda, a popular package manager with a large collection of pre-built packages for various domains, including data science, machine learning, scientific computing and more.
Conda provides package, dependency and environment management for any language and is very popular with Python and R. It allows users to manage and control the dependencies and versions of packages specific to each project, ensuring reproducibility and avoiding conflicts between different software requirements.
ClearlyDefined crawls both the main conda package and the source code for licensing metadata. The main conda package is hosted on the conda channels themselves and contains all necessary licensing information, compilers, environment configuration scripts and dependencies that are needed to make the package work. The source code from which the conda package is created oftentimes is hosted in an external website such as GitHub.
The conda crawler uses the following coordinates:
- type (required): conda or condasource
- provider (required): channel on which the package will be crawled, such as conda-forge, anaconda-main or anaconda-r
- namespace (optional): architecture and OS of the package to be crawled, i.e. win64, linux-aarch64 or any if no architecture is specified.
- package name (required): name of the package
- revision (optional): package version and optional build version
For example, the popular numpy package is represented as shown below.
With the increased importance of data science, machine learning and scientific computing, this support for conda packages in ClearlyDefined is extremely important. It will allow organizations to better manage the licenses of their conda packages for compliance. This work was led by Basit Ayantunde from CodeThink with the stewardship from Qing Tomlison from SAP. We would like to thank them and all those involved in the development and testing of this implementation.
Cailean Osborne: voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process. Today, we are featuring Cailean Osborne, one of the volunteers who has helped to shape and are shaping the OSAID.
Question: What’s your background related to Open Source and AI?My interest in Open Source AI began around 2020 when I was working in AI policy at the UK Government. I was surprised that Open Source never came up in policy discussions, given its crucial role in AI R&D. Having been a regular user of libraries like scikit-learn and PyTorch in my previous studies. I followed Open Source AI trends in my own time and eventually I decided to do a PhD on the topic. When I started my PhD back in 2021, Open Source AI still felt like a niche topic, so it’s been exciting to watch it become a major talking point over the years.
Beyond my PhD, I’ve been involved in Open Source AI community as a contributor to scikit-learn and as a co-developer of the Model Openness Framework (MOF) with peers from the Generative AI Commons community. Our goal with the MOF is to provide guidance for AI researchers and developers to evaluate the completeness and openness of “Open Source” models based on open science principles. We were chuffed that the OSI team chose to use the 16 components from the MOF as the rubric for reviewing models in the co-design process.
Question: What motivated you to join this co-design process to define Open Source AI?The short answer is: to contribute to establishing an accurate definition for “Open Source AI” and to learn from all the other experts involved in the co-design process. The longer answer is: There’s been a lot of confusion about what is or is not “Open Source AI,” which hasn’t been helped by open-washing. “Open source” has a specific definition (i.e. the right to use, study, modify, and redistribute source code) and what is being promoted as “Open Source AI” deviates significantly from this definition. Rather than being pedantic, getting the definition right matters for several reasons; for example, for the “Open Source” exemptions in the EU AI Act to work (or not work), we need to know precisely what “Open Source” models actually are. Andreas Liesenfeld and Mark Dingemanse have written a great piece about the issues of open-washing and how they relate to the AI Act, which I recommend reading if you haven’t yet. So, I got involved to help develop a definition and to learn from all the other experts involved. It hasn’t been easy (it’s a pretty divisive topic!), but I think we’ve made good progress.
Question: Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?First off, I have to give credit to Stef and Mer for maintaining momentum throughout the process. Coordinating a co-design effort with volunteers scattered around the globe, each with varying levels of availability and (strong) opinions on the matter, is no small feat. So, well done! I also enjoyed seeing how others agreed or disagreed when reviewing models. The moments of disagreement were the most interesting; for example, about whether training data should be available versus documented and if so, in how much detail… Personally, the main challenge was searching for information about the various components of models that were apparently “Open Source” and observing how little information was actually provided beyond weights, a model card, and if you’re lucky an arXiv preprint or technical report.
Question: Why do you think AI should be Open Source?When talking about the benefits of Open Source AI, I like to point folks to a 2007 paper, in which 16 researchers highlighted “The Need for Open Source Software in Machine Learning” due to basically the complete lack of OSS for ML/AI at the time. Fast forward to today, AI R&D is practically unthinkable without OSS, from data tooling to the deep learning frameworks used to build LLMs. Open source and openness in general have many benefits for AI, from enabling access to SOTA AI technologies and transparency which is key for reproducibility, scrutiny, and accountability to widening participation in their design, development, and governance.
Question: What do you think is the role of data in Open Source AI?If the question is strictly about the role of data in developing open AI models, the answer is pretty simple: Data plays a crucial role because it is needed for training, testing, aligning, and auditing models. But if the question is asking “should the release of data be a condition for an open model to qualify as Open Source AI,” then the answer is obviously much more complicated.
Companies are in no rush to share training data due to a handful of reasons: be it competitive advantage, data protection, or frankly being sued for copyright infringement. The copyright concern isn’t limited to companies: EleutherAI has also been sued and had to take down the Books3 dataset from The Pile. There are also many social and cultural concerns that restrict data sharing; for example, the Kōrero Kaitiakitanga license has been developed to protect the interests of indigenous communities in New Zealand. So, the data question isn’t easy and perhaps we shouldn’t be too dogmatic about it.
Personally, I think the compromise in v. 0.0.8, which states that model developers should provide sufficiently detailed information about data if they can’t release the training dataset itself, is a reasonable halfway house. I also hope to see more open pre-training datasets like the one developed by the community-driven BigScience Project, which involved open deliberation about the design of the dataset and provides extensive documentation about data provenance and processing decisions (e.g. check out their Data Catalogue). The FineWeb dataset by Hugging Face is another good example of an open pre-training dataset, which they released with pre-processing code, evaluation results, and super detailed documentation.
Question: Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?To be honest, my personal definition hasn’t changed much. I am not a big fan of the use of “Open Source AI” when folks specifically mean “open models” or “open-weight models”. What we need to do is raise awareness about appropriate terminology and point out “open-washing”, as people have done, and I must say that subjectively I’ve seen improvements: less “Open Source models” and more “open models”. But I will say that I do find “Open Source AI” a useful umbrella term for the various communities of practice that intertwine in the development of open models, including OSS, open data, and AI researchers and developers, who all bring different perspectives and ways of working to the overarching “Open Source AI” community.
Question: What do you think the primary benefit will be once there is a clear definition of Open Source AI?We’ll be able to reduce confusion about what is or isn’t “Open Source AI” and more easily combat open-washing efforts. As I mentioned before, this clarity will be beneficial for compliance with regulations like the AI Act which includes exemptions for “Open Source” AI.
Question: What do you think are the next steps for the community involved in Open Source AI?We still have many steps to take but I’ll share three for now.
First, we urgently need to improve the auditability and therefore the safety of open models. With OSS, we know that (1) the availability of source code and (2) open development enable the distributed scrutiny of source code. Think Linus’ Law: “Given enough eyeballs, all bugs are shallow.” Yet open models are more complex than just source code, and the lack of openness of many key components like training data is holding back adoption because would-be adopters can’t adequately run due diligence tests on the models. If we want to realise the benefits of “Open Source AI,” we need to figure out how to increase the transparency and openness of models —we hope the Model Openness Framework can help with this.
Second, I’m really excited about grassroots initiatives that are leading community-driven approaches to developing open models and open datasets like the BigScience project. They’re setting an example of how to do “Open Source AI” in a way that promotes open collaboration, transparency, reproducibility, and safety from the ground up. I can still count such initiatives with my fingers but I am hopeful that we will see more community-driven efforts in the future.
Third, I hope to see the public sector and non-profit foundations get more involved in supporting public interest and grassroots initiatives. France has been a role model on this front: providing a public grant to train the BigScience project’s BLOOM model on the Jean Zay supercomputer, as well as funding the scikit-learn team to build out a data science commons.
The Open Source Initiative joins CMU in launching Open Forum for AI: A human-centered approach to AI development
The Open Source Initiative (OSI) is pleased to share that we are joining the founding team of Open Forum for AI (OFAI), an initiative designed by Carnegie Mellon University (CMU) to foster a human-centered approach to artificial intelligence. OFAI aims to enhance our understanding of AI and its potential to augment human capabilities while promoting responsible development practices.
The missions of OSI and OFAI are well-aligned; at the heart of OFAI is a commitment to ensuring that AI development serves the public interest. With the support of renowned partners like Omidyar Network, NobleReach Foundation, and internal CMU funding, OFAI is positioned to serve as a pivotal platform for shaping AI strategies and policies that prioritize safety, privacy, and equity.
The OSI is proud to be part of this project. Stefano Mafulli and Deb Bryant from the OSI will participate in OFAI, integrating their efforts toward a standard Open Source AI Definition through a collaborative process involving stakeholders from the Open Source community, industry, and academia as well as their contributions to public policy.
A collective effortThe success of OFAI hinges on the diverse expertise it convenes. Leading this initiative is Sayeed Choudhury, Associate Dean for Digital Infrastructure at CMU and a member of the OSI Board. Alongside him, a team of CMU faculty members and external advisors will contribute knowledge in ethics, computational technologies, and inclusive AI research.
Notable participants like Michele Jawando from Omidyar Network and Arun Gupta from NobleReach Foundation have emphasized the importance of Open Source AI in driving innovation and inclusivity as well as the need for a human-centered, trust-based approach to AI development.
OFAI’s ambitious goalsOFAI aims to influence AI policy by coordinating research and policy objectives and advocating for transparent and inclusive AI development. The initiative will focus on five key areas:
- Research
- Technical prototypes
- Policy recommendations
- Community engagement
- Talent for service
Deb Bryant will lead Community Engagement, building in part upon the broad community of interest gathered through the public process of OSI’s Defining Open Source AI.
One of OFAI’s foundational projects is the creation of an “Openness in AI” framework, which seeks to make AI development more transparent and inclusive. This framework will serve as a vital resource for policymakers, researchers, and the broader community.
Looking aheadWith the OSI set to deliver a stable version of the Open Source AI Definition at All Things Open in October, the launch of OFAI magnifies the importance of this work to bring together diverse stakeholders to ensure AI technologies align with societal values and public interests.
Open Source AI Definition – Weekly update July 15
It has been quiet over the 4th of July weekend on the forums and OSI has been speaking at different events:
- @stefano spoke in a panel at the UN event OSPOs for Good. Access the recording here.
- @mer is speaking at Open Source Community Africa
- OSI was present at the Linux Foundation hosted AI_dev: Open Source GenAI & ML Summit Europe 2024. Read about the takeaways here.
- @jberkus expresses concern about the extensive resources required to certify AI systems, estimating that it would take weeks of work per system. This scale makes it impractical for a volunteer committee like License Review.
- @shujisado reflects on past controversies over license conformity, noting that Open Source AI has the potential for a greater economic impact than early Open Source” He acknowledges the need for a more robust certification process given this increased significance. He suggests that cooperation from the machine learning community or consortia might be necessary to address technical issues and monitor the certification process neutrally. He offers to help spread the word about OSAID within the Japanese ML/LLM development community.
@jberkus clarifies that the OSI would need full-time paid staff to handle the certifications, as the work cannot be managed by volunteers alone.
Mer Joyce: voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process. We’ll be featuring the voices of the volunteers who have helped shape and are shaping the Definition.
The OSI started researching the topic in 2022, and in 2023 began the co-design process of a new definition of Open Source that applies to AI. The OSI hired Mer Joyce, founder and principal of Do Big Good, as an independent consultant to lead the co-design process. She has worked for over a decade at the intersection of research, policy, innovation and social change.
Mer Joyce, process facilitator for the Open Source AI Definition About co-designCo-design, also called participatory or human-centered design, is a set of creative methods used to solve communal problems by sharing knowledge and power. The co-design methodology addresses the challenges of reaching an agreed definition within a diverse community (Costanza-Chock, 2020: Escobar, 2018: Creative Reaction Lab, 2018: Friedman et al., 2019).
As noted in MIT Technology Review’s article about the OSAID, “[t]he open-source community is a big tent… encompassing everything from hacktivists to Fortune 500 companies…. With so many competing interests to consider, finding a solution that satisfies everyone while ensuring that the biggest companies play along is no easy task.” (Gent, 2024).
The co-design method allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support from such a significant and broad group of people also creates a tension to be managed between moving swiftly enough to deliver outputs that can be used operationally and taking the time to consult widely to understand the big issues and garner community buy-in. Having Mer as facilitator of the OSAID co-design, with her in-depth experience, has been important in ensuring the integrity of the process.
The OSAID co-design processThe first step of the OSAID co-design process was to identify the freedoms needed for Open Source AI. After various online and in-person activities and discussions, including five workshops across the world, the community adopted the four freedoms for software, now adapted for AI systems:
- Freedom to Use the system for any purpose and without having to ask for permission.
- Freedom to Study how the system works and inspect its components.
- Freedom to Modify the system for any purpose, including to change its output.
- Freedom to Share the system for others to use with or without modifications, for any purpose.
The next step was the formation of four working groups to initially analyze four different AI systems and their components. To achieve better representation, special attention was given to diversity, equity and inclusion. Over 50% of the working group participants are people of color, 30% are black, 75% were born outside the US, and 25% are women, trans or nonbinary.
These working groups discussed and voted on which AI system components should be required to satisfy the four freedoms for AI. The components adopted are described in the Model Openness Framework developed by the Linux Foundation.
The vote compilation was performed based on the mean total votes per component (μ). Components that received over 2μ votes were marked as “required,” and between 1.5μ and 2μ were marked “likely required.” Components that received between 0.5μ and μ were marked as “likely not required,” and less than 0.5μ were marked “not required.”
After the working groups evaluated legal frameworks and legal documents for each component, each working group published a recommendation report. The end result is the OSAID with a comprehensive definition checklist encompassing a total of 17 components. More working groups are being formed to evaluate how well other AI systems align with the Definition.
OSAID multi-stakeholder co-design process: from component list to a definition checklist Meet Mer Joyce Video recorded by Ezequiel Lanza, Open Source AI Evangelist at IntelI am the process facilitator for the Open Source AI Definition, the Open Source Initiative project creating a definition of Open Source AI that will be a part of the stable public infrastructure of Open Source technology that everyone can benefit from, similar to the Open Source Definition that OSI currently stewards. The co-design of the Open Source AI Definition involves consulting with global stakeholders to ensure their vast range of needs are represented while integrating and weaving together the variety of different perspectives on what Open Source AI should mean.
If you would like to participate in the process, we’re currently on version 0.0.7. We will have a release candidate in June and a stable version in October. There is a public forum at discuss.opensource.org where anyone can create an account and make comments. As different versions are created, updates about our process are released here as well. I am available, as is the executive director of the OSI, to answer questions at bi-weekly town halls that are open for anyone to attend.
How to get involvedThe OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Beyond SPDX: expanding licenses identified by ClearlyDefined
ClearlyDefined is an Open Source project that helps organizations with supply chain compliance. Until recently, ClearlyDefined’s tooling only supported licenses that were part of the standardized SPDX license list. Any component identified by a license that was not part of this list resulted in NOASSERTION, which introduced uncertainty about the permissible use of such component, potentially hindering collaboration, creating legal complexities and security concerns for developers.
Fortunately, Scancode, which is an integral part of how ClearlyDefined detects and normalizes origin, dependencies and licensing metadata of components, already supports non-SPDX licenses thanks to its use of LicenseDB. LicenseDB is the largest free and open database of software licenses, in particular all the Open Source software licenses, with over 2000 community curated licenses texts and their metadata.
Philippe Ombredanne, the leading author of Scancode and LicenseDB, defended ClearlyDefined leveraging this capability already provided by Scancode:
As one of many examples, common public domain dedications are not tracked nor supported by SPDX and are not approved as OSI licenses. Not a single lawyer I know is treating these as proprietary licenses. They are carefully cataloged and properly detected by ScanCode (at least 850+ variants of these at last count plus an infinity of variations detected approximately)…
Collecting data is not endorsing nor promoting anything in particular be it proprietary, open source, free software, source available or else. But rather, just accepting that the world of actual licenses is what it is in all its glorious messy diversity and capturing what these licenses are, without discarding valuable information detected by ScanCode. Discarding and losing data has been the problem until now and has been making ClearlyDefined data mostly harmless and useless at scale as you get better and more information out of a straight ScanCode scan.
You are welcome to use anything you like, but I think it would be better to adopt the de-facto industry standard of ScanCode license data, rather than to reinvent the wheel, especially since ClearlyDefined is kinda using ScanCode rather heavily.
We use a suffix as LicenseRef-scancode in https://scancode-licensedb.aboutcode.org/ and guarantee stability of these with the track record to prove this.
After a healthy discussion on the topic, the ClearlyDefined community agreed that supporting non-SPDX licenses was important. Scancode already provides this functionality and it offers mapping from these non-SPDX licenses to the SPDX LicenseRef. Organizations using ClearlyDefined now have the option to decide how to handle non-SPDX licenses based on their own needs. This work to have ClearlyDefined use the latest version of Scancode and support non-SPDX licenses was led by Lukas Spieß from GitHub with the stewardship from Qing Tomlinson (from SAP) and E. Lynette Rayle (also from GitHub). We would like to thank them and all those involved in the development and testing of this implementation.
Highlights from AI_dev Paris
On June 19-20, the Linux Foundation hosted AI_dev: Open Source GenAI & ML Summit Europe 2024. This event brought together developers exploring the complex world of Open Source generative AI and Machine Learning. Central to this event is the conviction that Open Source drives innovation in AI. Please find below some highlights from AI_dev Paris and how they are aligned with OSI’s work on the Open Source AI Definition.
Keynote: Welcome & Opening RemarksIbrahim Haddad, Executive Director of the LF AI & Data Foundation, provided an overview of the major challenges in Open Source AI, which include:
- Lack of a common understanding of openness in AI
- Open Source software licenses used on non-software assets
- Diverse restrictions including the use of Acceptable Use Policies
- Lack of understanding of licenses and implications in the context of AI models
- Incomplete release of model components
To address some of these challenges, Haddad introduced the Model Openness Framework (MOF) and announced the official launch of the Model Openness Tool (MOT) at the conference.
Introducing the Model Openness Framework: Achieving Completeness and Openness in a Confusing Generative AI LandscapeAnni Lai, Matt White, and Cailean Osborne delved into the Model Openness Framework, a comprehensive system for evaluating and classifying the completeness and openness of Machine Learning models. This framework assesses which components of the model development lifecycle are publicly released and under what licenses, ensuring an objective evaluation. Matt White, Executive Director of the Pytorch Foundation and author of the MOF white paper, went on to demonstrate the Model Openness Tool, which evaluates each model across 3 classes: Open Science (Class I), Open Tooling (Class II), and Open Model (Class III).
Model Openness Tool: launched at the Linux Foundation’s AI_dev Paris conference The Open Source AI dilemma: Crafting a clear definition for Open Source AIOfer Hermoni, founder of the LF AI & Data Foundation, continued examining the Model Openness Framework and explained how this framework and its list of components serve as the basis for OSI’s Open Source AI Definition (OSAID). The OSAID evaluates each component on the four fundamental freedoms of Open Source:
- To use the system for any purpose and without having to ask for permission
- To study how the system works and inspect its components
- To modify the system for any purpose, including to change its output
- To share the system for others to use with or without modifications, for any purpose
Lea Gimpel and Daniel Brumund from the Digital Public Goods Alliance (DPGA) emphasized the importance of democratizing AI through digital public goods, including Open Source software, open AI models, open data, open standards, and open content. Lea highlighted that, while open data is desirable, it is not conditional. She supported the OSI’s Open Source AI Definition, as it helps the DPGA navigate legal uncertainties around data sharing and broadens the pool of potential solutions that can be recognized, marketed, and made available as digital public goods, thereby offering more opportunities to positively impact people’s lives.
ConclusionIt was clear throughout this conference that the work to create a standard Open Source AI Definition that upholds the fundamental freedoms of Open Source is vital for addressing some of the key challenges in AI and ML development and democratization. The OSI appreciates Linux Foundation’s collaboration toward this goal and its commitment to host another successful event to facilitate these important discussions.
Open Source AI Definition – Weekly update July 1
- Last week @quaid suggested conducting a controlled experiment to determine if data information alone is sufficient to recreate an AI model with fidelity to the original. He shared insights from the OpenVLA project, noting its possible compliance with the requirements of draft v0.0.8 and suggesting a test suite to compare models created with full datasets versus data information.
- To this, @Stefano noted that there also are some master students at CMU who are conducting similar experiments to “kick the tires” of the draft definition.
- @quaid proposed more precise criteria for evaluating model similarity, such as “functionally similar” or “practically similar” and further suggested detailing the values sought from open data datasets to improve the experiment’s framework.
- @hook has shared a research paper they found interesting and relevant tilted Rethinking open source generative AI: open-washing and the EU AI Act.
- This paper has been shared before by its author @mark and discussed in the context of whether the OSAID should contain a partially open license, arguing that in doing so, open washing would be limited, stating that “ I think providers and users of LLMs should not be free to create oil spills in our information landscape and I think RAIL provides useful guardrails for that.” This would highlight the “degrees of openness”.
- They too present their findings in a visualization of the degrees of openness of different systems.
- This is a point we have discussed before and note that the OSAID will not be a partially open license but a binary one. See week 22 summary for the context of this discussion.
- We held our 12th town hall meeting last week. You can access the recording and slides here if you missed it. The town hall presented some ideas for the next draft of the Definition, making it clear that there is no agreement yet on the data information concept and that part is still subject to change.
- A new town hall meeting is scheduled for Friday, July 12.
Open Source AI Definition – Weekly update June 24
Following @stefano’s publication regarding why the OSI considers training data to be “optional” under the checklist in Open Source AI Definition, the debate has continued. Here are the main points:
- Preferred Form of Modification
- @hartmans states finding an agreement on the meaning of “preferred form of modification” depends on the user’s objectives. The disagreement may stem from different priorities in ranking the freedoms associated with open source AI, though they emphasize prioritizing model weights for practical modifications. He suggested that data information could be more beneficial than raw data for understanding models and urged flexibility in AI definitions.
- @shujisado highlighted that training data for machine learning models is a preferred form of modification but questioned if it is the most preferred. He further emphasized the need for a flexible definition for preferred forms of modification in AI.
- @quaid supported the idea of conducting controlled experiments to determine if data information alone is sufficient to recreate AI models accurately. Suggested practical steps for testing the effectiveness of data information and encouraged community participation in such experiments.
- @jberkus raised concerns about the practical assessment of data information and its ability to facilitate the recreation of AI systems. He questioned how to evaluate data information without recreating the AI system.
- Practical Applications and Community Insights
- @hartmans proposed practical scenarios where data information could suffice for modifying AI models and suggested that the community’s flexibility in defining the preferred form of modification has been valuable for Debian.
- @quaid shared insights from his research on the OpenVLA project, noting its compliance with OSAID requirements. He further proposed conducting controlled experiments to verify if data information is enough to recreate models with fidelity.
- General observations
- @shujisado emphasized the need for flexible definitions in AI, drawing from open-source community experiences. Agreed on the complexity of training data issues and supported the flexible approach of OSI in defining the preferred form of modification.
- @quaid suggested practical approaches for evaluating data information and its adequacy for recreating AI models and proposed further experiments and community involvement to refine the understanding and application of data information in open-source AI.
- @jberkus asked whether OSAID will apply to licenses or systems, noting that current drafts focus on systems. He questioned if a certification program for reviewing systems as open source or proprietary is the intended direction.
- @shujisado confirmed that discussions are moving towards certifying AI systems and pointed at an existing thread. He emphasized the need for evaluating individual components of AI systems and expressed concern about OSI’s capacity to establish a certification mechanism, highlighting that it would significantly expand OSI’s role.