Open Source Initiative

Subscribe to Open Source Initiative feed Open Source Initiative
The steward of the Open Source Definition, setting the foundation for the Open Source Software ecosystem.
Updated: 20 hours 13 min ago

Hailey Schoelkopf: Voices of the Open Source AI Definition

Thu, 2024-07-25 07:45

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Hailey Schoelkopf What’s your background related to Open Source and AI?

One of the main reasons I was able to get more deeply involved in AI research was through open research communities such as the BigScience Workshop and EleutherAI, where discussions and collaboration were available to outsiders. These opportunities to share knowledge and learn from others more experienced than me were crucial to learning about the field and growing as a practitioner and researcher.

I co-lead the training of the Pythia language models (https://arxiv.org/abs/2304.01373), some of the first fully-documented and reproducible large-scale language models with as many related artifacts as possible released Open Source. We were happy and lucky to see these models fill a clear need, especially in the research community, where Pythia has since contributed to a large amount of studies attempting to build our understanding of LLMs, including interpreting their internals, understanding the process by which these models improve over training, and disentangling some of the effects of the dataset contents on these models’ downstream behavior.

What motivated you to join this co-design process to define Open Source AI?

There has been a significant amount of confusion induced by the fact that not all ‘open-weights’ AI models released are released under OSI-compliant licenses-–or impose restrictions on their usage or adaptation-–so I was excited that OSI was working on reducing this confusion by producing a clear definition that could be used by the Open Source community. I more directly joined the process by helping discuss how the Open Source AI Definition could be mapped onto the Pythia language models and the accompanying artifacts we released.

Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?

Deciding what counts as sufficient transparency and modifiability to be Open Source was an interesting problem. Although public model weights are very beneficial to the Open Source community, releasing model weights without sufficient detail to understand the model and its development process to make modifications or understand reasons behind its design and resulting characteristics can hinder understanding or prevent the full benefits of a completely Open Source model from being realized.

Why do you think AI should be Open Source?

There are clear advantages to having models that are Open Source. Access to such fully-documented models can help a much, much broader group of people–trained researchers and also many others–who can use, study, and examine these models for their own purposes. While not every model should be made Open Source under all conditions, wider scrutiny and study of these models can help increase our understanding of AI systems’ behavior, raise societal preparedness and awareness of AI capabilities, and improve these models’ safety by allowing more people to understand them and explore their flaws.

With the Pythia language models, we’ve seen many researchers explore questions around the safety and biases of these models, including a breadth of questions we’d not have been able to study ourselves, or many that we could not even anticipate. These different perspectives are a crucial component in making AI systems safer and more broadly beneficial.

What do you think is the role of data in Open Source AI?

Data is a crucial component of AI systems. Transparency around (and, potentially, open release of) training datasets can enable a wide range of extended benefits to researchers, practitioners, and society at large. I think that for a model to be truly Open Source, and to derive the greatest benefits from its openness, information on training data must be shared transparently. This information also importantly allows various members of the Open Source community to avoid replicating each other’s work independently. Transparent sharing about motivations and findings with respect to dataset creation choices can improve the community’s collective understanding of system and dataset design for the future and minimize overlapping, wasted effort.

Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?

An interesting perspective that I’ve grown to appreciate is that the Open Source AI definition includes public and Open Source licensed training and inference code. Actually making one’s Open Source AI model effectively usable by the community and practitioners is a crucial step of promoting transparency, though not often enough discussed.

What do you think the primary benefit will be once there is a clear definition of Open Source AI?

Having a clear definition of Open Source AI can make it clearer where existing currently “open” systems fall, and potentially encourage future open-weights models to be released with more transparency. Many current open-weights models are shared under bespoke licenses with terms not compliant with Open Source principles–this creates legal uncertainty and also makes it less likely that a new open-weights model release will benefit practitioners at large or contribute to better understanding of how to design better systems. I would hope that a clearer Open Source AI definition will make it easier to draw these lines and encourage those currently releasing open-weights models to do so in a way more closely fitting the Open Source AI standard.

What do you think are the next steps for the community involved in Open Source AI?

An exciting future direction for the Open Source AI research community is to explore methods for greater control over AI model behavior; attempting to explore approaches to collective modification and collaborative development of AI systems that can adapt and be “patched” over time. A stronger understanding of how to properly evaluate these systems for capabilities, robustness, and safety will also be crucial. I hope to see the community direct greater attention to evaluation in the future as well.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the working groups: be part of a team to evaluate various models against the OSAID.
  • Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
  • Comment on the latest draft: provide feedback on the latest draft document directly.
  • Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Categories: FLOSS Research

OSI at the United Nations OSPOs for Good

Wed, 2024-07-24 18:57

Earlier this month the Open Source Initiative participated in the “OSPOs for Good” event promoted by the United Nations in NYC. Stefano Maffulli, the Executive Director of the OSI, participated in a panel moderated by Mehdi Snene about Open Source AI alongside distinguished speakers Ashley Kramer, Craig Ramlal,  Sasha Luccioni, and Sergio Gago. Please find below a transcript of Stefano’s presentation.

Mehdi Snene  

What is Open Source in AI? What does it mean? What are the foundational pieces? How far along is the data? There is mention of weights, and data skills. How can we truly understand what Open Source in AI is? Today, joining us, we’ll have someone who can help us understand what Open Source in AI means and where we are heading. Stefano, can you offer your insights?

Stefano Maffulli  

Thanks. We have some thoughts on this. We’ve been pondering these questions since they first emerged when GPT started to appear. We asked ourselves: How do we transfer the principles of permissionless innovation and the immense value created by the Open Source ecosystem into the AI space?

After a little over two years of research and global conversations with multiple stakeholders, we identified three key elements. Firstly, permissionless innovation needs to be ported to AI, but this is complex and must be broken down into smaller components.

We realized that, as developers, users, and deployers of AI systems, we need to understand how these systems are built. This involves studying all components carefully, being able to run them for any purpose without asking for permission (a basic tenet of Open Source), and modifying them to change outputs based on the same inputs. These basic principles include being able to share these modifications with others.

To achieve this, you need data, the code used for training and cleaning the data (e.g., removing duplicates), the parameters, the weights, and a way to run inference on those weights. It’s fairly straightforward. However, the challenge lies in the legal framework.

Now, the complicated piece is how Open Source software has had a very wonderful run, based on the fact that the legal framework that governs Open Source is fairly simple and globally accepted. It’s built on copyright, a system that has worked wonderfully in both ways. It gives exclusive rights to the content creators, but also the same mechanism can be used to grant rights to anyone who receives the creation.

With data, we don’t have that mechanism. That is a very simple and dramatic realization. When we talk about data, we should pay attention to what kind of data we’re discussing. There is data as content created, and there is data as facts; like fires, speed limits, or traces of a road. Those are facts, and they have different ways of being treated. There is also private data, personal information, and various other kinds of data, each with different rules and regulations around the world.

Governments’ major role in the future will be to facilitate permissionless innovation in data by harmonizing these rules. This will level the playing field, where currently larger corporations have significantly more power than Open Source developers or those wishing to create large language models. Governments should help create datasets, remove barriers, and facilitate access for academia, smaller developers, and the global south.

Mehdi Snene  

We already have open data and Open Source. Now, we need to create open AI and open models. Are we bringing these two domains together and keeping them separate, or are we creating something new from scratch when we talk about open AI?

Stefano Maffulli 

This is a very interesting and powerful question. I believe that open data as a movement has been around for quite a while. However, it’s only recently that data scientists have truly realized the value they hold in their hands. Data is fungible and can be used to build new things that are completely different from their original domains.

We need to talk more about this and establish platforms for better interaction. One striking example is a popular dataset of images used for training many image generation AI tools, which contained child sexual abuse images for many years. A research paper highlighted this huge problem, but no one filed a bug report, and there was no easy way for the maintainers of this dataset to notice and remove those images.

There are things that the software world understands very well, and things that data scientists understand very well. We are starting to see the need for more space for interactions and learning from each other.

The conversation is extremely complicated. Alex and I have had long discussions about this. I don’t want to focus entirely on this, but I do want to say that Open Source has never been about pleasing companies or specific stakeholders. We need to think of it as an ecosystem where the balances of power are maintained.

While Open Source software and Open Source AI are still evolving, the necessary ingredients—data, code, and other components—are there. However, the data piece still needs to be debated and finalized. Pushing for radical openness with data has clear drawbacks and issues. It’s going to be a balance of intentions, aiming for the best outcome for the general public and the whole ecosystem.

Mehdi Snene  

Thank you so much. My next question is about the future. What are your thoughts on the next big technology?

Stefano Maffulli 

From the perspective of open innovation, it’s about what’s going to give society control over technology. The focus of Open Source has always been to enable developers and end-users to have sovereignty over the technology they use. Whether it’s quantum computers, AI, or future technologies, maintaining that control is crucial.

Governments need to play a role in enabling innovation and ensuring that no single power becomes too dominant. The balance between the private sector, public sector, nonprofit sector, and the often-overlooked fourth sector—which includes developers and creators who work for the public good rather than for profit—must be maintained. This balance is essential for fostering an ecosystem where all stakeholders have equal interests and influence.


If you would like to listen to the panel discussion in its entirety, you can do so here (the Open Source AI panel starts at 1:00:00 approximately).

Categories: FLOSS Research

Better identifying conda packages with ClearlyDefined

Tue, 2024-07-23 19:17

ClearlyDefined, an Open Source project that helps organizations with supply chain compliance,  now provides a new harvester implementation for conda, a popular package manager with a large collection of pre-built packages for various domains, including data science, machine learning, scientific computing and more.

Conda provides package, dependency and environment management for any language and is very popular with Python and R. It allows users to manage and control the dependencies and versions of packages specific to each project, ensuring reproducibility and avoiding conflicts between different software requirements.

ClearlyDefined crawls both the main conda package and the source code for licensing metadata. The main conda package is hosted on the conda channels themselves and contains all necessary licensing information, compilers, environment configuration scripts and dependencies that are needed to make the package work. The source code from which the conda package is created oftentimes is hosted in an external website such as GitHub.

The conda crawler uses the following coordinates:

  • type (required): conda or condasource
  • provider (required): channel on which the package will be crawled, such as conda-forge, anaconda-main or anaconda-r
  • namespace (optional): architecture and OS of the package to be crawled, i.e. win64, linux-aarch64 or any if no architecture is specified.
  • package name (required): name of the package
  • revision (optional): package version and optional build version

For example, the popular numpy package is represented as shown below.

With the increased importance of data science, machine learning and scientific computing, this support for conda packages in ClearlyDefined is extremely important. It will allow organizations to better manage the licenses of their conda packages for compliance. This work was led by Basit Ayantunde from CodeThink with the stewardship from Qing Tomlison from SAP. We would like to thank them and all those involved in the development and testing of this implementation.

Categories: FLOSS Research

Cailean Osborne: voices of the Open Source AI Definition

Thu, 2024-07-18 13:09

The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process. Today, we are featuring Cailean Osborne, one of the volunteers who has helped to shape and are shaping the OSAID.

Question: What’s your background related to Open Source and AI?

My interest in Open Source AI began around 2020 when I was working in AI policy at the UK Government. I was surprised that Open Source never came up in policy discussions, given its crucial role in AI R&D. Having been a regular user of libraries like scikit-learn and PyTorch in my previous studies. I followed Open Source AI trends in my own time and eventually I decided to do a PhD on the topic. When I started my PhD back in 2021, Open Source AI still felt like a niche topic, so it’s been exciting to watch it become a major talking point over the years. 

Beyond my PhD, I’ve been involved in Open Source AI community as a contributor to scikit-learn and as a co-developer of the Model Openness Framework (MOF) with peers from the Generative AI Commons community. Our goal with the MOF is to provide guidance for AI researchers and developers to evaluate the completeness and openness of “Open Source” models based on open science principles. We were chuffed that the OSI team chose to use the 16 components from the MOF as the rubric for reviewing models in the co-design process. 

Question: What motivated you to join this co-design process to define Open Source AI?

The short answer is: to contribute to establishing an accurate definition for “Open Source AI” and to learn from all the other experts involved in the co-design process. The longer answer is: There’s been a lot of confusion about what is or is not “Open Source AI,” which hasn’t been helped by open-washing. “Open source” has a specific definition (i.e. the right to use, study, modify, and redistribute source code) and what is being promoted as “Open Source AI” deviates significantly from this definition. Rather than being pedantic, getting the definition right matters for several reasons; for example, for the “Open Source” exemptions in the EU AI Act to work (or not work), we need to know precisely what “Open Source” models actually are. Andreas Liesenfeld and Mark Dingemanse have written a great piece about the issues of open-washing and how they relate to the AI Act, which I recommend reading if you haven’t yet. So, I got involved to help develop a definition and to learn from all the other experts involved. It hasn’t been easy (it’s a pretty divisive topic!), but I think we’ve made good progress.

Question: Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?

First off, I have to give credit to Stef and Mer for maintaining momentum throughout the process. Coordinating a co-design effort with volunteers scattered around the globe, each with varying levels of availability and (strong) opinions on the matter, is no small feat. So, well done! I also enjoyed seeing how others agreed or disagreed when reviewing models. The moments of disagreement were the most interesting; for example, about whether training data should be available versus documented and if so, in how much detail… Personally, the main challenge was searching for information about the various components of models that were apparently “Open Source” and observing how little information was actually provided beyond weights, a model card, and if you’re lucky an arXiv preprint or technical report.

Question: Why do you think AI should be Open Source?

When talking about the benefits of Open Source AI, I like to point folks to a 2007 paper, in which 16 researchers highlighted “The Need for Open Source Software in Machine Learning” due to basically the complete lack of OSS for ML/AI at the time. Fast forward to today, AI R&D is practically unthinkable without OSS, from data tooling to the deep learning frameworks used to build LLMs. Open source and openness in general have many benefits for AI, from enabling access to SOTA AI technologies and transparency which is key for reproducibility, scrutiny, and accountability to widening participation in their design, development, and governance. 

Question: What do you think is the role of data in Open Source AI?

If the question is strictly about the role of data in developing open AI models, the answer is pretty simple: Data plays a crucial role because it is needed for training, testing, aligning, and auditing models. But if the question is asking “should the release of data be a condition for an open model to qualify as Open Source AI,” then the answer is obviously much more complicated. 

Companies are in no rush to share training data due to a handful of reasons: be it competitive advantage, data protection, or frankly being sued for copyright infringement. The copyright concern isn’t limited to companies: EleutherAI has also been sued and had to take down the Books3 dataset from The Pile. There are also many social and cultural concerns that restrict data sharing; for example, the Kōrero Kaitiakitanga license has been developed to protect the interests of indigenous communities in New Zealand. So, the data question isn’t easy and perhaps we shouldn’t be too dogmatic about it.  

Personally, I think the compromise in v. 0.0.8, which states that model developers should provide sufficiently detailed information about data if they can’t release the training dataset itself, is a reasonable halfway house. I also hope to see more open pre-training datasets like the one developed by the community-driven BigScience Project, which involved open deliberation about the design of the dataset and provides extensive documentation about data provenance and processing decisions (e.g. check out their Data Catalogue). The FineWeb dataset by Hugging Face is another good example of an open pre-training dataset, which they released with pre-processing code, evaluation results, and super detailed documentation.

Question: Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?

To be honest, my personal definition hasn’t changed much. I am not a big fan of the use of “Open Source AI” when folks specifically mean “open models” or “open-weight models”. What we need to do is raise awareness about appropriate terminology and point out “open-washing”, as people have done, and I must say that subjectively I’ve seen improvements: less “Open Source models” and more “open models”. But I will say that I do find “Open Source AI” a useful umbrella term for the various communities of practice that intertwine in the development of open models, including OSS, open data, and AI researchers and developers, who all bring different perspectives and ways of working to the overarching “Open Source AI” community.

Question: What do you think the primary benefit will be once there is a clear definition of Open Source AI?

We’ll be able to reduce confusion about what is or isn’t “Open Source AI” and more easily combat open-washing efforts. As I mentioned before, this clarity will be beneficial for compliance with regulations like the AI Act which includes exemptions for “Open Source” AI.  

Question: What do you think are the next steps for the community involved in Open Source AI?

We still have many steps to take but I’ll share three for now.

First, we urgently need to improve the auditability and therefore the safety of open models. With OSS, we know that (1) the availability of source code and (2) open development enable the distributed scrutiny of source code. Think Linus’ Law: “Given enough eyeballs, all bugs are shallow.” Yet open models are more complex than just source code, and the lack of openness of many key components like training data is holding back adoption because would-be adopters can’t adequately run due diligence tests on the models. If we want to realise the benefits of “Open Source AI,” we need to figure out how to increase the transparency and openness of models —we hope the Model Openness Framework can help with this. 

Second, I’m really excited about grassroots initiatives that are leading community-driven approaches to developing open models and open datasets like the BigScience project. They’re setting an example of how to do “Open Source AI” in a way that promotes open collaboration, transparency, reproducibility, and safety from the ground up. I can still count such initiatives with my fingers but I am hopeful that we will see more community-driven efforts in the future.
Third, I hope to see the public sector and non-profit foundations get more involved in supporting public interest and grassroots initiatives. France has been a role model on this front: providing a public grant to train the BigScience project’s BLOOM model on the Jean Zay supercomputer, as well as funding the scikit-learn team to build out a data science commons.

Categories: FLOSS Research

The Open Source Initiative joins CMU in launching Open Forum for AI: A human-centered approach to AI development

Mon, 2024-07-15 16:32

The Open Source Initiative (OSI) is pleased to share that we are joining the founding team of Open Forum for AI (OFAI), an initiative designed by Carnegie Mellon University (CMU) to foster a human-centered approach to artificial intelligence. OFAI aims to enhance our understanding of AI and its potential to augment human capabilities while promoting responsible development practices.

The missions of OSI and OFAI are well-aligned; at the heart of OFAI is a commitment to ensuring that AI development serves the public interest. With the support of renowned partners like Omidyar Network, NobleReach Foundation, and internal CMU funding, OFAI is positioned to serve as a pivotal platform for shaping AI strategies and policies that prioritize safety, privacy, and equity.

The OSI is proud to be part of this project. Stefano Mafulli and Deb Bryant from the OSI will participate in OFAI, integrating their efforts toward a standard Open Source AI Definition through a collaborative process involving stakeholders from the Open Source community, industry, and academia as well as their contributions to public policy.  

A collective effort

The success of OFAI hinges on the diverse expertise it convenes. Leading this initiative is Sayeed Choudhury, Associate Dean for Digital Infrastructure at CMU and a member of the OSI Board. Alongside him, a team of CMU faculty members and external advisors will contribute knowledge in ethics, computational technologies, and inclusive AI research. 

Notable participants like Michele Jawando from Omidyar Network and Arun Gupta from NobleReach Foundation have emphasized the importance of Open Source AI in driving innovation and inclusivity as well as the need for a human-centered, trust-based approach to AI development.

OFAI’s ambitious goals

OFAI aims to influence AI policy by coordinating research and policy objectives and advocating for transparent and inclusive AI development. The initiative will focus on five key areas: 

  • Research
  • Technical prototypes
  • Policy recommendations
  • Community engagement
  • Talent for service

Deb Bryant will lead Community Engagement, building in part upon the broad community of interest gathered through the public process of OSI’s Defining Open Source AI.

One of OFAI’s foundational projects is the creation of an “Openness in AI” framework, which seeks to make AI development more transparent and inclusive. This framework will serve as a vital resource for policymakers, researchers, and the broader community.

Looking ahead

With the OSI set to deliver a stable version of the Open Source AI Definition at All Things Open in October, the launch of OFAI magnifies the importance of this work to bring together diverse stakeholders to ensure AI technologies align with societal values and public interests.

Categories: FLOSS Research

Open Source AI Definition – Weekly update July 15

Mon, 2024-07-15 15:26

It has been quiet over the 4th of July weekend on the forums and OSI has been speaking at different events:

Why and how to certify Open Source AI
  • @jberkus expresses concern about the extensive resources required to certify AI systems, estimating that it would take weeks of work per system. This scale makes it impractical for a volunteer committee like License Review.
  • @shujisado reflects on past controversies over license conformity, noting that Open Source AI has the potential for a greater economic impact than early Open Source” He acknowledges the need for a more robust certification process given this increased significance. He suggests that cooperation from the machine learning community or consortia might be necessary to address technical issues and monitor the certification process neutrally. He offers to help spread the word about OSAID within the Japanese ML/LLM development community.

@jberkus clarifies that the OSI would need full-time paid staff to handle the certifications, as the work cannot be managed by volunteers alone.

Categories: FLOSS Research

Mer Joyce: voices of the Open Source AI Definition

Wed, 2024-07-10 09:36

The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process. We’ll be featuring the voices of the volunteers who have helped shape and are shaping the Definition.

The OSI started researching the topic in 2022, and in 2023 began the co-design process of a new definition of Open Source that applies to AI. The OSI hired Mer Joyce, founder and principal of Do Big Good, as an independent consultant to lead the co-design process. She has worked for over a decade at the intersection of research, policy, innovation and social change.

Mer Joyce, process facilitator for the Open Source AI Definition About co-design

Co-design, also called participatory or human-centered design, is a set of creative methods used to solve communal problems by sharing knowledge and power. The co-design methodology addresses the challenges of reaching an agreed definition within a diverse community (Costanza-Chock, 2020: Escobar, 2018: Creative Reaction Lab, 2018: Friedman et al., 2019). 

As noted in MIT Technology Review’s article about the OSAID, “[t]he open-source community is a big tent… encompassing everything from hacktivists to Fortune 500 companies…. With so many competing interests to consider, finding a solution that satisfies everyone while ensuring that the biggest companies play along is no easy task.” (Gent, 2024). 

The co-design method allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support from such a significant and broad group of people also creates a tension to be managed between moving swiftly enough to deliver outputs that can be used operationally and taking the time to consult widely to understand the big issues and garner community buy-in. Having Mer as facilitator of the OSAID co-design, with her in-depth experience, has been important in ensuring the integrity of the process. 

The OSAID co-design process

The first step of the OSAID co-design process was to identify the freedoms needed for Open Source AI. After various online and in-person activities and discussions, including five workshops across the world, the community adopted the four freedoms for software, now adapted for AI systems:

  • Freedom to Use the system for any purpose and without having to ask for permission.
  • Freedom to Study how the system works and inspect its components.
  • Freedom to Modify the system for any purpose, including to change its output.
  • Freedom to Share the system for others to use with or without modifications, for any purpose.

The next step was the formation of four working groups to initially analyze four different AI systems and their components. To achieve better representation, special attention was given to diversity, equity and inclusion. Over 50% of the working group participants are people of color, 30% are black, 75% were born outside the US, and 25% are women, trans or nonbinary.

These working groups discussed and voted on which AI system components should be required to satisfy the four freedoms for AI. The components adopted are described in the Model Openness Framework developed by the Linux Foundation.

The vote compilation was performed based on the mean total votes per component (μ). Components that received over 2μ votes were marked as “required,” and between 1.5μ and 2μ were marked “likely required.” Components that received between 0.5μ and μ were marked as “likely not required,” and less than 0.5μ were marked “not required.”

After the working groups evaluated legal frameworks and legal documents for each component, each working group published a recommendation report. The end result is the OSAID with a comprehensive definition checklist encompassing a total of 17 components. More working groups are being formed to evaluate how well other AI systems align with the Definition.

OSAID multi-stakeholder co-design process: from component list to a definition checklist Meet Mer Joyce Video recorded by Ezequiel Lanza, Open Source AI Evangelist at Intel

I am the process facilitator for the Open Source AI Definition, the Open Source Initiative project creating a definition of Open Source AI that will be a part of the stable public infrastructure of Open Source technology that everyone can benefit from, similar to the Open Source Definition that OSI currently stewards. The co-design of the Open Source AI Definition involves consulting with global stakeholders to ensure their vast range of needs are represented while integrating and weaving together the variety of different perspectives on what Open Source AI should mean.

If you would like to participate in the process, we’re currently on version 0.0.7. We will have a release candidate in June and a stable version in October. There is a public forum at discuss.opensource.org where anyone can create an account and make comments. As different versions are created, updates about our process are released here as well. I am available, as is the executive director of the OSI, to answer questions at bi-weekly town halls that are open for anyone to attend.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the working groups: be part of a team to evaluate various models against the OSAID.
  • Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
  • Comment on the latest draft: provide feedback on the latest draft document directly.
  • Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
One of the many OSAID workshops organized by Mer Joyce around the world
Categories: FLOSS Research

Beyond SPDX: expanding licenses identified by ClearlyDefined

Tue, 2024-07-09 14:26

ClearlyDefined is an Open Source project that helps organizations with supply chain compliance. Until recently, ClearlyDefined’s tooling only supported licenses that were part of the standardized SPDX license list. Any component identified by a license that was not part of this list resulted in NOASSERTION, which introduced uncertainty about the permissible use of such component, potentially hindering collaboration, creating legal complexities and security concerns for developers.

Fortunately, Scancode, which is an integral part of how ClearlyDefined detects and normalizes origin, dependencies and licensing metadata of components, already supports non-SPDX licenses thanks to its use of LicenseDB. LicenseDB is the largest free and open database of software licenses, in particular all the Open Source software licenses, with over 2000 community curated licenses texts and their metadata.

Philippe Ombredanne, the leading author of Scancode and LicenseDB, defended ClearlyDefined leveraging this capability already provided by Scancode:

As one of many examples, common public domain dedications are not tracked nor supported by SPDX and are not approved as OSI licenses. Not a single lawyer I know is treating these as proprietary licenses. They are carefully cataloged and properly detected by ScanCode (at least 850+ variants of these at last count plus an infinity of variations detected approximately)…

Collecting data is not endorsing nor promoting anything in particular be it proprietary, open source, free software, source available or else. But rather, just accepting that the world of actual licenses is what it is in all its glorious messy diversity and capturing what these licenses are, without discarding valuable information detected by ScanCode. Discarding and losing data has been the problem until now and has been making ClearlyDefined data mostly harmless and useless at scale as you get better and more information out of a straight ScanCode scan.

You are welcome to use anything you like, but I think it would be better to adopt the de-facto industry standard of ScanCode license data, rather than to reinvent the wheel, especially since ClearlyDefined is kinda using ScanCode rather heavily.

We use a suffix as LicenseRef-scancode in https://scancode-licensedb.aboutcode.org/ and guarantee stability of these with the track record to prove this.

After a healthy discussion on the topic, the ClearlyDefined community agreed that supporting non-SPDX licenses was important. Scancode already provides this functionality and it offers mapping from these non-SPDX licenses to the SPDX LicenseRef. Organizations using ClearlyDefined now have the option to decide how to handle non-SPDX licenses based on their own needs. This work to have ClearlyDefined use the latest version of Scancode and support non-SPDX licenses was led by Lukas Spieß from GitHub with the stewardship from Qing Tomlinson (from SAP) and E. Lynette Rayle (also from GitHub). We would like to thank them and all those involved in the development and testing of this implementation.

Categories: FLOSS Research

Highlights from AI_dev Paris

Wed, 2024-07-03 13:34

On June 19-20, the Linux Foundation hosted AI_dev: Open Source GenAI & ML Summit Europe 2024. This event brought together developers exploring the complex world of Open Source generative AI and Machine Learning. Central to this event is the conviction that Open Source drives innovation in AI. Please find below some highlights from AI_dev Paris and how they are aligned with OSI’s work on the Open Source AI Definition.

Keynote: Welcome & Opening Remarks

Ibrahim Haddad, Executive Director of the LF AI & Data Foundation, provided an overview of the major challenges in Open Source AI, which include:

  • Lack of a common understanding of openness in AI
  • Open Source software licenses used on non-software assets
  • Diverse restrictions including the use of Acceptable Use Policies
  • Lack of understanding of licenses and implications in the context of AI models
  • Incomplete release of model components

To address some of these challenges, Haddad introduced the Model Openness Framework (MOF) and announced the official launch of the Model Openness Tool (MOT) at the conference.

Introducing the Model Openness Framework: Achieving Completeness and Openness in a Confusing Generative AI Landscape

Anni Lai, Matt White, and Cailean Osborne delved into the Model Openness Framework, a comprehensive system for evaluating and classifying the completeness and openness of Machine Learning models. This framework assesses which components of the model development lifecycle are publicly released and under what licenses, ensuring an objective evaluation. Matt White, Executive Director of the Pytorch Foundation and author of the MOF white paper, went on to demonstrate the Model Openness Tool, which evaluates each model across 3 classes: Open Science (Class I), Open Tooling (Class II), and Open Model (Class III).

Model Openness Tool: launched at the Linux Foundation’s AI_dev Paris conference The Open Source AI dilemma: Crafting a clear definition for Open Source AI

Ofer Hermoni, founder of the LF AI & Data Foundation, continued examining the Model Openness Framework and explained how this framework and its list of components serve as the basis for OSI’s Open Source AI Definition (OSAID). The OSAID evaluates each component on the four fundamental freedoms of Open Source:

  • To use the system for any purpose and without having to ask for permission
  • To study how the system works and inspect its components
  • To modify the system for any purpose, including to change its output
  • To share the system for others to use with or without modifications, for any purpose
Toward AI Democratization with Digital Public Goods

Lea Gimpel and Daniel Brumund from the Digital Public Goods Alliance (DPGA) emphasized the importance of democratizing AI through digital public goods, including Open Source software, open AI models, open data, open standards, and open content. Lea highlighted that, while open data is desirable, it is not conditional. She supported the OSI’s Open Source AI Definition, as it helps the DPGA navigate legal uncertainties around data sharing and broadens the pool of potential solutions that can be recognized, marketed, and made available as digital public goods, thereby offering more opportunities to positively impact people’s lives.

Conclusion

It was clear throughout this conference that the work to create a standard Open Source AI Definition that upholds the fundamental freedoms of Open Source is vital for addressing some of the key challenges in AI and ML development and democratization. The OSI appreciates Linux Foundation’s collaboration toward this goal and its commitment to host another successful event to facilitate these important discussions.

Categories: FLOSS Research

Open Source AI Definition – Weekly update July 1

Mon, 2024-07-01 11:48
An open call to test OpenVLA
  • Last week @quaid suggested conducting a controlled experiment to determine if data information alone is sufficient to recreate an AI model with fidelity to the original. He shared insights from the OpenVLA project, noting its possible compliance with the requirements of draft v0.0.8 and suggesting a test suite to compare models created with full datasets versus data information.
    • To this, @Stefano noted that there also are some master students at CMU who are conducting similar experiments to “kick the tires” of the draft definition.
    • @quaid proposed more precise criteria for evaluating model similarity, such as “functionally similar” or “practically similar” and further suggested detailing the values sought from open data datasets to improve the experiment’s framework.
Interesting research paper: “Rethinking open source generative AI: open-washing and the EU AI Act” Open Source AI Definition Town Hall – June 28, 2024
  • We held our 12th town hall meeting last week. You can access the recording and slides here if you missed it. The town hall presented some ideas for the next draft of the Definition, making it clear that there is no agreement yet on the data information concept and that part is still subject to change.
  • A new town hall meeting is scheduled for Friday, July 12.
Categories: FLOSS Research

Open Source AI Definition – Weekly update June 24

Mon, 2024-06-24 15:36
Explaining the concept of Data information

Following @stefano’s publication regarding why the OSI considers training data to be “optional” under the checklist in Open Source AI Definition, the debate has continued. Here are the main points:

  • Preferred Form of Modification
  • @hartmans states finding an agreement on the meaning of “preferred form of modification” depends on the user’s objectives. The disagreement may stem from different priorities in ranking the freedoms associated with open source AI, though they emphasize prioritizing model weights for practical modifications. He suggested that data information could be more beneficial than raw data for understanding models and urged flexibility in AI definitions.
  • @shujisado highlighted that training data for machine learning models is a preferred form of modification but questioned if it is the most preferred. He further emphasized the need for a flexible definition for preferred forms of modification in AI.
  • @quaid supported the idea of conducting controlled experiments to determine if data information alone is sufficient to recreate AI models accurately. Suggested practical steps for testing the effectiveness of data information and encouraged community participation in such experiments.
    • @stefano added that some students at CMU will run this kind of experiment (if full training dataset is needed or if data information is enough to recreate a model that can be tested for fidelity to the original) to test the definition. 
  • @jberkus raised concerns about the practical assessment of data information and its ability to facilitate the recreation of AI systems. He questioned how to evaluate data information without recreating the AI system.
  • Practical Applications and Community Insights
    • @hartmans proposed practical scenarios where data information could suffice for modifying AI models and suggested that the community’s flexibility in defining the preferred form of modification has been valuable for Debian.
    • @quaid shared insights from his research on the OpenVLA project, noting its compliance with OSAID requirements. He further proposed conducting controlled experiments to verify if data information is enough to recreate models with fidelity.
  • General observations 
  • @shujisado emphasized the need for flexible definitions in AI, drawing from open-source community experiences. Agreed on the complexity of training data issues and supported the flexible approach of OSI in defining the preferred form of modification.
  • @quaid suggested practical approaches for evaluating data information and its adequacy for recreating AI models and proposed further experiments and community involvement to refine the understanding and application of data information in open-source AI.
Are we evaluating Licenses or Systems?
  • @jberkus asked whether OSAID will apply to licenses or systems, noting that current drafts focus on systems. He questioned if a certification program for reviewing systems as open source or proprietary is the intended direction.
  • @shujisado confirmed that discussions are moving towards certifying AI systems and pointed at an existing thread. He emphasized the need for evaluating individual components of AI systems and expressed concern about OSI’s capacity to establish a certification mechanism, highlighting that it would significantly expand OSI’s role.
Categories: FLOSS Research

Open Source AI Definition – Weekly update June 17

Mon, 2024-06-17 12:52
Explaining the concept of Data information
  • After much debate regarding training data, @stefano published a summary of the positions expressed and some clarifications about the terminology included in draft v.0.0.8. You can read the rationale about it and share your thoughts on the forum
  • Initial thoughts:
    • @Senficon (Felix Reda) adds that while the discussion has highlighted the case for data information, it’s crucial to understand the implications of copyright law on AI, particularly concerning access to training data. Open Source software relies on a legal element (copyright licenses) and an access element (availability of source code). However, this framework does not seamlessly apply to AI, as different copyright regimes allow text and data mining (TDM) for AI training but not the redistribution of datasets. This discrepancy means that requiring the publication of training datasets would make Open Source AI models illegal, despite TDM exceptions that facilitate AI development. Also, public domain status is not consistent internationally, complicating the creation of legally publishable datasets. Consequently, a definition of Open Source AI that imposes releasing datasets would impede collaborative improvements and limit practical significance. Emphasizing data innovation can help maintain Open Source principles without legal pitfalls.
Concerns and feedback on anchoring on the Model Openness Framework
  • @amcasari expresses concern about the usability and neutrality of the “Model Openness Framework” (MOF) for identifying AI systems, suggesting it doesn’t align well with current industry practices and isn’t ready for practical application without further feedback and iteration.
  • @shujisado points out that the MOF’s classification of components doesn’t depend on the specific IP laws applied, but rather on a general legal framework, and highlights that Japan’s IP law system differs from the US and EU, yet finds discussions based on the OSD consistent.
  • @stefano emphasizes the importance of having well-thought-out, timeless principles in the Open Source AI Definition document, while viewing the Checklist as a more frequently updated working document. He also supports the call to see practical examples of the framework in use and proposes separating the Checklist from the main document to reduce confusion.
Initial Report on Definition Validation
  • Reviews of eleven different AI systems have been published. We do these review to check existing systems compatibility with our current definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
    • @mer has set up a review sheet for the Viking model upon request from @merlijn-sebrechts.
    • @anatta8538 asks if MLOps is considered within the topic of the Model Openness Framework and whether CLIP, an LMM, would be consistent with the OSAID.
    • @nick clarifies that the evaluation focuses on components as described in the Model Openness Framework, which includes development and deployment aspects but does not cover MLOps as a whole.
Why and how to certify Open Source AI
  • @Alek_Tarkowski agrees that certification of open-source AI will be crucial under the AI Act and highlights the importance of defining what constitutes an Open Source license. He points out the confusion surrounding terms like “free and open source license” and suggests that the issue of responsible AI licensing as a form of Open Source licensing needs resolution. Notes that some restrictive licenses are gaining traction and may need consideration for exemption from regulation, thus urging for a consensus.
Open Source AI Definition Town Hall – June 14, 2024

Slides and the recording of our previous townhall meeting can be found here.

Categories: FLOSS Research

Explaining the concept of Data information

Fri, 2024-06-14 09:53

There seems to be some confusion caused by the concept of Data information included in the draft v0.0.8 of the Open Source AI Definition. Some readers may have seen the original dataset included in the list of optional components and quickly jumped to the wrong conclusions. This post clarifies how the draft arrived at its current state, the design principles behind the Data information concept and the constraints (legal and technical) it operates under.

The objective of the Open Source AI Definition

The objective of the Open Source AI Definition is to replicate in the context of artificial intelligence (AI) the principles of autonomy, transparency, frictionless reuse, and collaborative improvement for end users and developers of AI systems. These are described in the preamble.

Following the preamble is the definition of Open Source AI, an adaptation of the definition of Free Software (also known as “the four freedoms”) to AI nomenclature. The preamble and the four freedoms have been co-designed over several meetings and public discussions, online and in-person, and have not recently received significant comments. 

The Free Software definition specifies that a precondition to the freedom to study and modify a program is to have access to the source code. Source code is defined as “the preferred form of the program for making changes in.” Draft v0.0.8 contains a description of what’s necessary to enjoy the freedoms to study and modify an AI system. This new section titled Preferred form to make modifications to machine-learning systems has generated a heated debate. 

What is the preferred form to make modifications

The concept of “preferred form to make modifications” focuses on machine learning systems because these systems require data and training to produce a working system. Other AI systems are more easily classifiable as software and don’t require a special definition. 

The system analysis phase of the co-design process revealed that studying and modifying machine learning systems requires data, code for training and inference and model parameters. For the parameters, there’s no ambiguity: an Open Source AI must make them available under terms that respect the Open Source principles (no field-of-use restrictions, no discrimination against people, etc). For the data and code requirements, the text in the “preferred form to make modifications” section is longer and harder to parse, generating some confusion. 

The intent of the code and data requirements is to  ensure that end users, deployers and developers of an Open Source AI system have all the tools and instructions to recreate that AI system from scratch, to satisfy the freedoms to study and modify the system. At a high-level view, it makes sense to suggest that training datasets should be mandatorily released with permissive licenses in order to be Open Source AI.

However on close examination, it became clear that sharing the original datasets is full of traps. It actually puts Open Source at a disadvantage compared to opaque and proprietary AI systems.

The issue with data

Data is not software: The legal landscape for data is much wider than copyright. Aggregating large datasets and distributing them internationally is an endless nightmare that includes privacy laws, copyright, sui-generis rights, patents, secrets and more. Without diving deeper into legal issues, let’s focus on practical examples to clarify why the distribution of the training dataset is not spelled out as a requirement in the concept of Data information.

  • The Pile, the open dataset used to train the very open Pythia models, was taken down after an alleged copyright infringement, currently being litigated in the United States. However, the Pile appears to be legal to share in Japan. It’s also unclear whether it can be legally shared in the European Union. 
  • DOLMA, the open dataset used to train the very open OLMo models, was initially released with a restrictive license. It later switched to a permissive one. On further inspection, DOLMA appears to suffer from the same legal uncertainties of the Pile, however the Allen Institute has not been sued yet.
  • Training techniques that preserve privacy like federated learning don’t create datasets. 

All these cases show that requiring the original datasets creates vagueness and uncertainty in applying the Open Source AI Definition:

  • If a dataset is only legal in Japan, is that AI Open Source only in Japan?
  • If a dataset is initially legally available but later retracted, does the AI go from being Open Source to not?
    • If so, what happens to the applications that use such AI?
  • If no dataset is created, then will any AI trained with such techniques ever be Open Source?

Additionally, there are reasons to believe that OpenAI, Anthropic and other proprietary systems have been trained on the same questionable data inside The Pile and DOLMA: Proving that’s the case is a lot harder and expensive though. This is clearly a disincentive to be open and transparent on the data sources, adding a burden to the organizations that try to do the right thing.

The solution to these questions, draft v0.0.8 contains the concept of Data information, coupled with code requirements to obtain the expected result: for end users, developers and deployers of AI systems to be able to reproduce an Open Source AI.

Understanding the concept of Data Information

Data information, in the draft Open Source AI Definition, is defined as: 

Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

Read that from the end: The intention of Data information is to allow developers to recreate a substantially equivalent system using the same or similar data. That means that an Open Source AI must disclose all the ingredients, where they’ve been bought and all the instructions to prepare the dish.  

This is a solution that came out of the co-design process, where reviewers didn’t rank the training datasets as high as they ranked the training code and data transparency requirements. 

Data information and the code requirements also address all of the questions around the legality of distributing data and datasets, or their absence.

If a dataset is only legal in Japan or becomes illegal later, one should still be able to recreate a dataset suitable to train an equivalent system replacing the illegal or unavailable pieces with similar ones.

AI systems trained with federated learning (where a dataset isn’t created) can still be Open Source AI if all instructions and code are released so that a new training with different data can generate an equivalent system.

The Data information concept also solves an example (raised on the forum) of an AI system trained on data licensed directly from Reddit. In this case, if the original developers released enough information to allow another AI developer to recreate a substantially equivalent system with Reddit data taken from an existing dataset, like CommonCrawl, it would be considered Open Source AI.

The proposed alternatives

While generally well received, draft v0.0.8 has been criticized by a few people on the forum for putting the training dataset in the “optional requirements”. Some suggestions and pushback we’ve received:

  • Require the use of synthetic data when the training dataset cannot be legally shared: This technique may work in some corner cases, if the technology evolves to be reliable enough. It’s expensive and untested at scale.
  • Classify as Open Source AI systems where all their components are “open source”: This approach is not rooted in the longstanding practice of the GNU project to accept system library exceptions and other compromises in exchange for more Open Source tools.
  • Datasets built by crawling the internet are the equivalent of theft, they shouldn’t be allowed  at all, let alone allowed in Open Source AI: This pushback ignores the reality that large data aggregators already have acquired legally the rights to accumulate that same data (through scraping and terms of use) and are trading it, exclusively capturing the economic value of what should be in the commons. Read Towards a Books Data Commons for AI Training for more details. There is no general agreement that text and data mining is equivalent to theft.

These demands and suggestions are hard to accept. We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones. We need a Definition that contains positive examples from the start so we can practically demonstrate positive qualities to policymakers. 

The discussion about data, how to generate incentives to create datasets that can be distributed internationally, safely, preserving privacy, is extremely complex. It can be addressed separately from the Open Source AI Definition. In collaboration with Open Future Foundation and others, OSI is designing a series of conferences to tackle the data governance issue. We’ll make an announcement soon.

Have your say now

The concept of Data information and code requirements is hard to grasp at first. But the preliminary results of the validation phase confirm that the draft v0.0.8 works as expected: Pythia and OLMo both would be Open Source AI, while Falcon, Grok, Llama, Mistral would not (even if they used OSD-compatible licenses) because they don’t share Data information. BLOOM and StarCoder would fail because of field-of-use restrictions in their models.

Data information can be improved but it’s better than other solutions proposed so far. As we get closer to the release of the stable version of the Open Source AI Definition, we need to hear from you: If you support this concept please comment on the forum today. If you don’t support it, please try to propose an alternative that at least covers the practical examples of Pile, DOLMA and federated learning above. Help the community move the conversation forward.

Continue the conversation in the forum

Categories: FLOSS Research

Open Source AI Definition – Weekly update June 10

Tue, 2024-06-11 17:40
Open Source AI needs to require data to be viable
  • With many different discussions happening at once, here are the main points:
    • On the issue of training data
      • @mark is concerned with openness of AI not being meaningful if there is not a focus on the training data.” Model weights are the most inscrutable component of current generative AI, and providers that release only [the weights] should not get a free ‘openness’ pass.”
      • @stefano agrees with all of that but questions the criteria used to assign green marks in Mark’s paper, pointing out inconsistencies. They use the example of Pythia-Chat-Base-7, which relies on a dataset from OpenDataHub with potential issues like non-versioned data and stale links, failing to meet stringent requirements required by @juliaferraioli. Similar concerns are raised for other models like OLMo 7B Instruct, which lack specific data versioning details. Maffulli also highlights the case of Pythia-7B, which once may have been compliant but it’s now problematic due to the unavailability of its foundational dataset, the Pile, illustrating the complexities in maintaining an “open source” status over time, if the stringent proposal suggested by @juliaferraioli and the AWS team is adopted.
      • @shujisado adds that while he sympathizes with @juliaferraioli‘s request for datasets, @stefano‘s arguments in support of the concept of “Data information” are aligned with the OSI principles and are reasonable.
      • @spotaws stresses that “data information” alone is insufficient if the data itself is too vague.
      • @juliaferraioli adds that while replicating AI systems like OLMo or Pythia may seem impractical due to costs and statistical nature, the capability is crucial for broader adoption and consistency.  She finds the current definition to be unclear and subjective.
      • @zack recommends to review StarCoder2, recognizing that it would be in the same category of BLOOM: a system with lots of transparency and a dataset made available but released with a restrictive license.
      • @Ezequiel_Lanza joined the conversation in support of the concept of Data information, claiming, with technical arguments that “sharing the dataset is not necessarily required and may not justify the potential risks associated with making it mandatory.”
      • Partially open / restrictive licenses
        • Continuing @marks points regarding restrictive licenses (like the ethical licenses), @stefano has added a link to an article highlighting some reasons why OSI is staying away from these licenses.
        • @pchestek further adds that a partially open license would create even more opportunities for open washing, as “open source AI” could have many meanings.
        • @mark clarified that rather than proposing a variety of meanings, they are seeking to highlight the dimensions of openness in their paper, exploring the broader landscape. 
        • @stefano adds that in the 26 years of OSI, it has contended with numerous organizations claiming varying degrees of openness as “open source. This issue is now mirrored in AI, as companies seek the market value of being labeled Open Source. Open Source is binary: either users have full rights or they don’t, and any system that falls short is not Open Source AI, regardless of how “almost” open it is.
      • Field of use/restriction 
        • @juliaferraioli believes that OSAID should include prohibitions against field-of-use restrictions.
        • @shujisado adds that OSAID specifies four freedoms as requirements for being considered open source and that this should be understood as the same since “freedom” is the same as “non-restricted”. The 10 clauses of the OSD have been replaced by the checklist in draft v0.0.8.
        • @juliaferraioli adds that individual components may be covered by their individual licenses, but the overall system may be subject to additional terms, which is why we need this to be explicit.
Initial Report on Definition Validation
  • @Mer has added how far we are regarding our system analysis compared to our current draft definition. Some points that remain incomplete have been highlighted.
  • Mistral (Mixtral 8x7B) is considered not in alignment with the OSAID because its data pre-processing code is not released under an OSI-approved license.
Can a derivative of non-open-source AI be considered Open Source AI?
  • @tarek_ziade shares his experience fine-tuning a “small” model (200M parameters) for a Firefox feature to describe images, using a base model for image encoding and text decoding. Despite not having 100% traceability of upstream data, Tarek argues that intentional fine-tuning and transparency make the new fine-tuned model open source. Any issues arising from downstream data can be addressed by the project maintainers, maintaining the model’s open source status.
Town hall recording out
  • We held our 10th town hall meeting a week and a half ago. You can access the recording here if you missed it.
  • A new town hall meeting is scheduled for this Friday, June 14.
Categories: FLOSS Research

Contributions of Open Source to AI: a panel discussion at CPDP-ai conference

Tue, 2024-06-04 05:00

I participated as a panelist at the CPDP-ai 2024 conference in Brussels last week where we discussed the significant contributions of Open Source to AI and highlighted the specific properties that differentiate Open Source AI from proprietary solutions. Representing the Open Source Initiative (OSI), the globally recognized non-profit that defines the term Open Source, I emphasized the longstanding principle of granting users full agency and control over technology, which has been proven to deliver extensive social benefits.

Below is a glimpse at the questions and answers posed to me and my fellow panelists:

Question: Stefano, please explain what the contribution to AI from Open Source is, and if there are specific properties of Open Source AI that make a difference for the users and for the people who are confronted with its results.

Response: The Definition of Open Source Software has existed for over 25 years; That doesn’t apply to AI. The Open Source Definition for software provides a stable north star for all participants in the digital ecosystem, from small and large companies to citizens and governments.

The basic principle of the Open Source Definition is to grant to the users of any technology full agency and control over the technology itself. This means that users of Open Source technologies have self-sovereignty of the technical solutions.

The Open Source Definition has demonstrated that massive social benefits accrue when you remove the barriers to learning, using, sharing and improving software systems. There is ample evidence that giving users agency, control and self-sovereignty of their technical choices produces a viable ecosystem based on permissionless innovation. Multiple studies by the EU Commission and Harvard researchers have assigned significant economic value to Open Source Software, all based on that single, clear, understood and approved Definition from 26 years ago.

For AI, and especially the most recent machine learning solutions, it’s less clear how society can maintain self-sovereignty of the technology and how to achieve permissionless innovation. Despite the fact that many people talk about Open Source AI, including the AI Act, there is no shared understanding of what that means, yet!

The Open Source Initiative is concluding a global, multi-stakeholder co-design process to find an unequivocal definition of Open Source AI, and we’re heading towards the conclusion of this process with a vastly increased knowledge of the AI machine learning space. The current draft of the Open Source AI Definition recognizes that in order to study, use, share and modify AI, one needs to refer to an AI system, not a single individual component. The global process has identified the components required for society to maintain control of the technology and these are: 

  • Detailed information about the dataset used to train the system and the code so that a skilled person can train a system with similar capabilities
  • All the libraries and tools used to run training and inference
  • The model architecture and the parameters, like weights and biases

Having unrestricted access to all these elements is what makes an AI an Open Source AI.

We’re in the final stretch of the process, starting to gather support for the current draft of the definition.

The most controversial part of the discussion is the role of data in the training. To answer your question about the power of big foreign tech companies, putting aside the hardware requirements, the data is where the fight is. There seem to be two views of the world on data when it comes to AI: One thinks that text and data mining is basically strip mining humanity and all accumulation of data without consent of the rights holders must be made illegal. Another view of the world is that text and data mining for the purpose of training Open Source AI is probably the only antidote to the superpowers of large corporations. These camps haven’t found a common position yet. Japan seems to have made up its mind already, legalizing unrestricted text and data mining. We’ll see where the lawsuits in the US will go, if they ever get to a decision in court or, as I suspect, they will be settled out of court. 

In any case, data, competence and to some extent hardware, are the levers to control the development of AI. 

Open Source has been leveling the playing field of technologies. We know from past experience with Open Source software that giving people unrestricted access to the means of digital production enables tremendous economic value. This worked in Europe as well as in China. We think that Open Source AI can have the same effect of generating value while leaving control of the technology in the hands of society.

Question: Big tech companies are important for the development of AI. Apart from the purely technological impacts, there is also economic importance. The European Commission has been very concerned about the Digital Single Market recently, and has initiated legislation such as DSA and DMA to improve competition and market access. Will these instruments be sufficient in view of AI roll-out, thinking also of the recently adopted AI Act? Or will additional attention need to be paid?

Response: Open is the best antidote to the concentration of power. That said, I see these legislations as the sticks, very necessary. I’d love us to think also about carrots. We don’t want to repeat the mistakes of the past with the early years of the internet. Open Source software was equally available in the US and Europe but despite that, the few European champions of Open Source haven’t grown big enough to have a global impact. And some of the biggest EU companies aren’t exactly friendly with Open Source either. 

Chinese companies have taken a different approach. But in Europe we have talents, and we have an attractive quality of life so we can get even more talents. Finding money is never an issue. We need to remove the disincentives to grow our companies bigger, widen the access to the internal EU market and support their international expansion, too.

For example, we need to review European Regulation 1025, on standardization to accommodate for Open Source. 1025 Regulation was written at a time when Open Source was considered a “business model” and information and communication technology standards were about voltages in a wire. Today, Open Source is between 80% and 90% of all software and “digital elements” comprise some part of every modern product. Even hardware solutions are dominated by “digital elements.” As such, the approach taken by 1025 is out of date and most likely needs a root-and-branch rethink to properly apply to the world today and the world we anticipate tomorrow.

We need to make sure that the standardization rules required by the Cyber Resilience Act are written together with Open Source champions so the rules don’t favor exclusively the cartel of European patent holders who try to seek rent instead of innovating. Europe has all the means to be at the center of AI innovation; It embodies the right values of diversity and collaboration. 

Closing remarks: We think that Open Source is the best antidote to fight market concentration in AI. Data is where the concentration of power is happening now and it’s in the hands of massive corporations: not only Google, Meta, Amazon, Reddit but also Sony, Warner, Netflix, Getty Images, Adobe … All these companies have already gained access to massive amounts of data, legally. These companies basically own our data, legally: Our pictures, the graph of our circles of friends, all the books and movies… 

There is a risk that if we don’t write policies that allow text and data mining in exchange of a real Open Source AI (one that society can fully control) then we risk leaving the most powerful AI systems in the hands of the oligopoly who can afford trading money for access to data.

Categories: FLOSS Research

Open Source AI Definition – Weekly update June 3

Mon, 2024-06-03 14:27
Initial report on definition validation
  • A first draft of the report of the validation phase has been published. The validation phase is designed to review the compatibility of existing systems with the current draft definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
  • Problems and initial findings:
    • Elusive documents: Not having system creators involved meant reviewers had to independently search for legal documents, resulting in many blanks in the document list and subsequent analysis.
    • One component, many artifacts, and documents: Some components were linked to multiple artifacts and documents, complicating the review process as source code and documentation could be spread across several repositories and reports.
    • Compounded components: Components in the checklist often combined multiple artifacts, such as training and validation code, making it difficult to track down specific legal documents.
    • Compliant? Conformant? Six out of eleven required components need a legal framework that is “compliant” or “conformant” with the Open Source Definition, prompting a need for clearer guidance on reviewing non-software components.
    • Reverting to the license: Reviewers suggested simplifying the process by relying on whether a legal document is OSI-approved, conformant, or compliant to guarantee the right to use, study, modify, and share the component, eliminating the need for independent assessment.
  • Next steps:
    • As we are looking to fill in the gaps from above we call on both system creators and independent volunteers to complete various system reviews. 
    • If your familiar system is not on the list, contact Mer on the forum
  • Initial questions and queries:
    • @jasonbrooks asks if the validation process should check if there’s “sufficiently detailed information about the data used to train the system so a skilled person can recreate a substantially equivalent system.” It’s unclear if this has been confirmed, and examples of skilled individuals achieving this would be helpful.
      • @stefano replies that the Preferred form lists enduring principles, while the Checklist details required components. Validation ensures components like training methodologies and data provenance are available, enabling system recreation. Mer’s report highlights the difficulty in finding these components, suggesting a need for a better method. One idea is a detailed survey for AI developers, though companies like Meta might misuse the “Open Source” label. Public pressure may eventually deter such abuses.
    • @amcasari adds insights into the process of reviewing licenses.
Open Source AI needs to require data to be viable 
  • This week, the conversation shifted heavily toward the possibilities of creating a gradient approach to open licensing.
  • @Markhas shared that he is publishing a paper regarding open washing, the AI ACT, and a case for a gradient notion of openness.
    • In line with previous points mostly raised by @danish_contactor, Mark highlights the RAIL licenses and argues that it should count towards openness too, stating that “I think providers and users of LLMs should not be free to create oil spills in our information landscape and I think RAIL provides useful guardrails for that.”
    • They also present their visualization of the degrees of openness of different systems 
  • @stefano has reiterated that the open-source AI definition will remain binary, just like the Open Source Definition is binary. And responding to @Markhas and @danish_contactor, he linked to Kate Downing legal analysis of RAIL licensing framework.
Can a derivative of non-open-source AI be considered Open Source AI? 
  • Answering @stefano’s earlier questions, @mark adds that it’s challenging to fine-tune a model without knowing the initial training data and techniques. Examples like Meta and Mistral fine-tunes show success despite the lack of transparency in the original training data. Intel’s Neural 7B and AllenAI’s Tulu 70B demonstrate effective fine-tuning with detailed disclosure of fine-tuning steps and data. However, these efforts can’t qualify as truly open AI systems due to the closed nature of the base models and potential legal liabilities.
  • @stefano closed the topic stating that, based on feedback, “Derivatives of non-Open Source AI cannot be Open Source AI”
Why and how to certify Open Source AI
  • @amscott added that AI developers will likely self-certify compliance with the OSAID, with objective certification needed for arbitration in nuanced cases. Like the OSD, the OSAID will mature through community practice. A simple self-certification tool could promote transparency and document good practices.
  • @mark added that The EU AI Act emphasizes “Open Source” systems, offering exemptions attractive to companies like Meta and Mistral. The AI Act requires disclosure templates overseen by an AI Office, potentially leading to intense lobbying efforts. If Open Source organizations influence regulation and certification, transparency may strengthen the Open Source ecosystem.
Question regarding the 0.0.8 definition 
  • Question from @Jennifer Ding regarding why “information” is a focus for the data category and not the code and model categories.
  • @Matt White adds that OSD-Conformant (in the checklist) should be defined somewhere.
    • He further adds (to Data Information, under checklist) that many “open” models withhold various forms of data, making it unreasonable to expect model producers to release all the information necessary for full replication of the data pipeline if data is not a required component of the definition
  • @Micheal Dolan adds that ”the use of OSD-compliant and OSD-conformant without any definitions of either term is difficult to parse the meaning of.” and suggests some solutions.
OSAID at PyCon US
  • Missing a recap of how we got to where we are now? OSI was present at PyCon in Pittsburgh where we held a workshop regarding our current definition and spoke with many knowledgeable shareholders. You can read about it here.
Categories: FLOSS Research

OSI at PyCon US: engaging with AI practitioners and developers as we reach OSAID’s first release candidate

Wed, 2024-05-29 08:00

As part of the Open Source AI Definition roadshow and as we approach the first release candidate of the draft, the Open Source Initiative (OSI) participated at PyCon US 2024, the annual gathering of the Python community. This opportunity was important because PyCon US brings together AI practitioners and developers alike, and having their input regarding what constitutes Open Source AI is of most value. The OSI organized a workshop and had a community booth there.

OSAID Workshop: compiling a FAQ to make the definition clear and easy to use

The OSI has embarked on a co-design process with multiple stakeholders to arrive at the Open Source AI Definition (OSAID). This process has been led by Mer Joyce, the co-design expert and facilitator, and Stefano Maffulli, the executive director of the OSI.

At the workshop organized at PyCon US, Mer provided an overview of the co-design process so far, summarized below.

The first step of the co-design process was to identify the freedoms needed for Open Source AI. After various online and in-person activities and discussions, including five workshops across the world, the community identified four freedoms:

  1. To Use the system for any purpose and without having to ask for permission.
  2. To Study how the system works and inspect its components.
  3. To Modify the system for any purpose, including to change its output.
  4. To Share the system for others to use with or without modifications, for any purpose.

The next step was to form four working groups to initially analyze four AI systems. To achieve better representation, special attention was given to diversity, equity and inclusion. Over 50% of the working group participants are people of color, 30% are black, 75% were born outside the US and 25% are women, trans and nonbinary.

These working groups discussed and voted on which AI system components should be required to satisfy the four freedoms for AI. The components we adopted are described in the Model Openness Framework developed by the Linux Foundation.

The vote compilation was performed based on the mean total votes per component (μ). Components which received over 2μ votes were marked as required and between 1.5μ and 2μ were marked likely required. Components that received between 0.5μ and μ were marked likely not required and less than 0.5μ as not required.

The working groups evaluated legal frameworks and legal documents for each component. Finally, each working group published a recommendation report. The end result is the OSAID with a comprehensive definition checklist encompassing a total of 17 components. More working groups are being formed to evaluate how well other AI systems align with the definition.

OSAID multi-stakeholder process: from component list to a definition checklist

After providing an overview of the co-design process, Mer went on to organize an exercise with the participants to compile a FAQ.

The questions raised at the workshop revolved around the following topics:

  • End user comprehension: how and why are AI systems different from Open Source software? As an end-user, why should they care if an AI system is open?
  • Datasets: Why is data itself not required? Should Open Source AI datasets be required to prove copyright compliance? How can one audit these systems for bias without the data? What does data provenance and data labeling entail?
  • Models: How can proper attribution of model parameters be enforced? What is the ownership/attribution of model parameters which were trained by one author and then “fine-tuned” by another?
  • Code: Can projects that include only source code (no data info or model weights) still use a regular Open Source license (MIT, Apache, etc.)?
  • Governance: For a specific AI, who determines whether the information provided about the training, dataset, process, etc. is “sufficient” and how?
  • Adoption of the OSAID: What are incentives for people/companies to adopt this standard?
  • Legal weight: Is the OSAID supposed to have legal weight?

These questions and answers raised at the workshop will be important for enhancing the existing FAQ, which will be made available along with the OSAID.

OSAID workshop: a collection of post-its with questions raised by participants. Community Booth: gathering feedback on the “Unlock the OSAID” visualization

At the community booth, the OSI held two activities to draw in participants interested in Open Source AI. The first activity was a quiz developed by Ariel Jolo, program coordinator at the OSI, to assess participants’ knowledge of  Python and AI/ML. Once we had an understanding of their skills, we went on to the second and main activity, which was to gather feedback on the OSAID using a novel way to visualize how different AI systems match the current draft definition as described below.

Making it easy for different stakeholders to visualize whether or not an AI system matches the OSAID is a challenge, especially because there are so many components involved. This is where the visualization concept we named “Unlock the OSAID” came in. 

The OSI keyhole is a well recognized logo that represents the source code that unlocks the freedoms to use, study, modify, and share software. With the Unlock the OSAID, we played on that same idea, but now for AI systems. We displayed three keyholes representing the three domains these 17 components fall within: code, model and data information.

Here is the image representing the “code keyhole” with the required components to unlock the OSAID:

On the inner ring we have the required components to unlock the OSAID, while on the outer ring we have optional components. The required code components are: libraries and tools; inference; training, validation and testing; data pre-processing. The optional components are: inference for benchmark and evaluation code. 

To fully unlock the OSAID, an AI system must have all the required components for code, model and data information. To better understand how the “Unlock the OSAID” visualization works, let’s look at two hypothetical AI systems: example 1 and example 2.

Let’s start looking at example 1 (in red) and see if this system unlocks the OSAID for code:

Example 1 only provides inference code, so the key (in red) doesn’t “fit” the code keyhole (in green).

Now let’s look at example 2 (in blue):

Example 2 provides all required components (and more), so the key (in blue) fits the code keyhole (in green). Therefore, example 2 unlocks the OSAID for code. For example 2 to be considered Open Source AI, it would also have to unlock the OSAID for model and data information: 

We received good feedback from participants about the “Unlock the OSAID” visualization. Once participants grasped the concept of the keyholes and which components were required or optional, it was easy to identify if an AI system unlocks the OSAID or not. They could visually see if the keys fit the keyholes or not. If all keys fit, then that AI system adheres to the OSAID.

Final thoughts: engaging with the community and promoting Open Source principles

For me, the highlight of PyCon US was the opportunity to finally meet members of the OSI and the Python community in person, both new and old acquaintances. I had good conversations with Deb Nicholson (Python Software Foundation), Hannah Aubry (Fastly), Ana Hevesi (Uploop), Tom “spot” Callaway (AWS), Julia Ferraioli (AWS), Tony Kipkemboi (Streamlit), Michael Winser (Alpha-Omega), Jason C. MacDonald (OWASP), Cheuk Ting Ho (CMD Limes), Kamile Demir (Adobe), Mariatta Wijaya (PSF), Loren Clary (PSF) and Miaolai Zhou (AWS). I also interacted with many folks from the following communities: Python Brazil, Python en Español, PyLadies and Black Python Devs. It was great to bump into great legends like Seth Larson (PSF), Peter Wang (Anaconda) and Guido van Rossum.

I loved all the keynotes, in particular from Sumana Harihareswara about how she has improved Python Software Foundation’s infrastructure, and from Simon Willison about how we can all benefit from Open Source AI.

We also had a special dinner hosted by Stefano to celebrate this special milestone of the OSAID, with Stefano, Mer and I overlooking Pittsburgh.

Overall, our participation at PyCon US was a success. We shared the work OSI has been doing toward the first release candidate of the Open Source AI Definition, and we did it in an entertaining and engaging way, with plenty of connection throughout.

Photo credits: Ana Hevesi, Mer Joyce, and Nick Vidal

Categories: FLOSS Research

Open Source AI Definition – Weekly update May 27

Tue, 2024-05-28 04:41
Open Source AI needs to require data to be viable
  • @juliaferraioli and the AWS team have reopened the debate regarding access to training data. This comes in a new forum which mirrors concerns raised in a previous one. They argue that to achieve modifiability, an AI system must ship the original training dataset used to train it. Full transparency and reproducibility require the release of all datasets used to train, validate, test, and benchmark. For Ferraioli, data is considered equivalent to source code for AI systems, therefore its inclusion should not be optional. In a message signed by the AWS Open Source team, she proposed that original training datasets or synthetic data with justification for non-release be required to meet the Open Source AI standard.
  • @stefano added some reminders as we reopen this debate. These are the points to keep in mind:
    • Abandon the mental map that makes you look for the source of AI (or ML) as that map has been driving us in circles. Instead, we’re looking for the “preferred form to make modifications to the system”
    • The law in most legislation around the world makes it illegal to distribute data, because of copyright, privacy and other laws. It’s also not as clear how the law treats datasets and it’s constantly changing 
    • Text of draft 0.0.8 is drafted to be vague on purpose regarding “Data information”. This is to resist the test of time and technology changes. 
    • When criticizing the draft, please provide specific examples in your question, and avoid arguing in the abstract. 
  • @danish_contractor argues that the current draft is likely to disincentivize openness due to the community viewing models (BLOOM or StarCoder), which include usage restrictions to prevent harms, less favorably despite being more transparent, reproducible, and thus more “open” than models like Mistral.
  • @Pam Chestek clarified that Open Source has two angles: the rights to use, study, modify and share, coupled with those rights being unrestricted. Both are equally important.
  • This debate echoes earlier ones on recognizing open components of an AI system.
The FAQ page has been updated
  • The FAQ page is starting to take shape and we would appreciate more feedback. So far, we have preliminary answers to these questions:
    • Why is the original training dataset not required?
    • Why the grant of freedoms is to its users?
    • What are the model parameters?
    • Are model parameters copyrightable?
    • What does “Available under OSD-compliant license” mean?
    • What does “Available under OSD-conformant terms” mean?
    • Why is the Open Source AI Definition includes a list of components while the Open Source Definition for software doesn’t say anything about documentation, roadmap and other useful things?
    • Why is there no mention of safety and risk limitations in the Open Source AI Definition?
Draft v0.0.8 Review from LLM360
  • @vamiller has submitted on behalf of the LLM360 team a review of their models. In his view the v0.0.8 reflect the principles of Open Source applied to AI. He asks about the ODC-By licence, arguing that it is compatible with OSI’s principles but it’s a data-only license.
Join the next town hall meeting
  • The next town hall meeting will take place on May 31st at 3:00 pm – 4:00 pm UTC. We encourage all who can participate to attend. This week, we will delve deeper into the issues regarding access (or not) to training data.
Categories: FLOSS Research

Exploring openness in AI: Insights from the Columbia Convening

Thu, 2024-05-23 08:00

Over the past year, a robust debate has emerged regarding the benefits and risks of open sourcing foundation models in AI. This discussion has often been characterized by high-level generalities or narrow focuses on specific technical attributes. One of the key challenges—one that the OSI community is addressing head on—is defining Open Source within the context of foundation models. 

A new framework is proposed to help inform practical and nuanced decisions about the openness of AI systems, including foundation models. The recent proceedings from the Columbia Convening on Openness in Artificial Intelligence, made available for the first time this week, are a welcome addition to the process.

The Columbia Convening brought together experts and stakeholders to discuss the complexities and nuances of openness in AI. The goal was not to define Open Source AI but to illuminate the multifaceted nature of the issue. The proceedings reflect the February conversations and are based on the backgrounder text developed collaboratively with the working group.

One of the significant contributions of these proceedings is the framework for understanding openness across the AI stack. The framework summarizes previous work on the topic, analyzes the various reasons for pursuing openness, and outlines how openness varies in different parts of the AI stack, both at the model and system levels. This approach provides a common descriptive framework to deepen a more nuanced and rigorous understanding of openness in AI. It also aims to enable further work around definitions of openness and safety in AI.

The proceedings emphasize the importance of recognizing safety safeguards, licenses, and documents as attributes rather than components of the AI stack. This evolution from a model stack to a system stack underscores the dynamic nature of the AI field and the need for adaptable frameworks.

These proceedings are set to be released in time for the upcoming AI Safety Summit in South Korea. This timely release will help maintain momentum ahead of further discussions on openness at the French summit in 2025.

We’re happy to see collaboration of like-minded individuals in discussing and solving the varied problems associated with openness in AI.

Categories: FLOSS Research

Open Source AI Definition – Weekly update May 20

Mon, 2024-05-20 10:43

A week loaded with important questions.

Overarching concerns with Draft v.0.0.8 and suggested modifications

A post signed by the AWS Open Source raised important questions, illustrating a disagreement on the concept of “Data information.”

  • A detailed post signed by the AWS Open Source team raises concerns about the draft concept of Data information in v0.0.8 and other important topics. I suggest reading their post. The major points discussed this week are:
    • The discussion on training data is not settled. AWS Open Source team argues that for an Open Source AI Definition to be effective, the data used to train the AI system must be included, similar to the requirement for source code in Open Source software. They say the current definitions mark the inclusion of datasets as optional, undermining transparency and reproducibility.
    • Their suggestion: Use synthetic data where the inclusion of actual datasets poses legal or privacy risks.
      • Valentino Giudice takes issues with the phrase “or AI systems, data is the equivalent of source code,” and states that “equivalent” is used too liberally here. For trained models, the dataset isn’t necessary to understand the model’s operations, which are determined by architecture and frameworks.
        • Ferraioli disagrees, stating that “A trained model cannot be considered open source without the data, processing code, and training code. Comparing a trained model to a software binary, we don’t call binaries open source without the source code being available and licensed as open source. “
      • Zacchiroli adds that they support the suggestion to use “high quality equivalent synthetic datasets” when the original data cannot be released. Although “equivalent” remains undefined and could create loopholes, this issue doesn’t worsen OSAID
    • Some proposed modifications otherwise include:
    • Require Release of Dependent Datasets
      • Mandate the release of training, testing, validation, and benchmarking datasets under an open data license or high-quality synthetic data if legal restrictions apply.
      • Update the “Data Information” section to make dataset release a requirement.
  • Prevent Restrictions on Outputs
    • Prohibit restrictions on the use, modification, or distribution of outputs generated by Open Source AI systems.
  • Eliminate Optional Components
    • Remove optional components from the OSAID to maintain a high standard of openness and transparency.
  • Address Combinatorial Ambiguity
    • Ensure any license applied to the distribution of multiple components in an Open Source AI system is OSD-approved.
Why and how to certify Open Source AI
  • The post from AWS team contained a comment about certification process for Open Source AI that deserves a separate thread. There are pending questions to be answered:
    • who exactly needs a certification that an AI system is Open Source AI?
    • who is going to use such certification? Is anyone of the groups deploying open foundation models today thinking that they could use one? For what purpose?
    • who is going to consume the information carried by the certification, why and how?
  • Zacchiroli adds that the need for certifying AI systems as OSAID compliant arises from inherent ambiguities in the definitions, such as terms like “sufficiently” and “high quality equivalent synthetic dataset.” Disagreements on compliance will require a judging authority, akin to OSI for the OSD. While managing judgments for OSAID might be more complex due to the potential volume, the community is likely to turn to OSI for such decisions.
Can a derivative of non-open-source AI be considered Open Source AI?
  • This question was asked on the draft document and moved to the forum for higher visibility. Is it technically possible to fine-tune a model without knowing the details of its initial training? Are there examples of successfully fine-tuned AI/ML systems where the initial training data and techniques were unknown but the fine-tuning data and methods were fully disclosed?
    • Shuji Sado added that fine-tuning typically involves updating the weights of newly added layers and some layers of the pre-trained model, but not all layers, to maintain the benefits of pre-training.
    • Valentino Giudice raised concerns over this point as multiple strategies for fine-tuning exist, allowing for flexibility in updating weights in any amount of existing layers without necessarily adding new ones. Even updating the entire network can be beneficial, as it leverages the pre-trained model’s information and can be more efficient than training a new model from scratch. Fine-tuning can slightly adjust the model’s performance or behaviour, integrating new data effectively.

Please, especially if you are knowledgeable in this field, we would love to hear more thoughts!

Categories: FLOSS Research

Pages