FLOSS Research

Open Source AI Definition – Weekly update June 24

Open Source Initiative - Mon, 2024-06-24 15:36
Explaining the concept of Data information

Following @stefano’s publication regarding why the OSI considers training data to be “optional” under the checklist in Open Source AI Definition, the debate has continued. Here are the main points:

  • Preferred Form of Modification
  • @hartmans states finding an agreement on the meaning of “preferred form of modification” depends on the user’s objectives. The disagreement may stem from different priorities in ranking the freedoms associated with open source AI, though they emphasize prioritizing model weights for practical modifications. He suggested that data information could be more beneficial than raw data for understanding models and urged flexibility in AI definitions.
  • @shujisado highlighted that training data for machine learning models is a preferred form of modification but questioned if it is the most preferred. He further emphasized the need for a flexible definition for preferred forms of modification in AI.
  • @quaid supported the idea of conducting controlled experiments to determine if data information alone is sufficient to recreate AI models accurately. Suggested practical steps for testing the effectiveness of data information and encouraged community participation in such experiments.
    • @stefano added that some students at CMU will run this kind of experiment (if full training dataset is needed or if data information is enough to recreate a model that can be tested for fidelity to the original) to test the definition. 
  • @jberkus raised concerns about the practical assessment of data information and its ability to facilitate the recreation of AI systems. He questioned how to evaluate data information without recreating the AI system.
  • Practical Applications and Community Insights
    • @hartmans proposed practical scenarios where data information could suffice for modifying AI models and suggested that the community’s flexibility in defining the preferred form of modification has been valuable for Debian.
    • @quaid shared insights from his research on the OpenVLA project, noting its compliance with OSAID requirements. He further proposed conducting controlled experiments to verify if data information is enough to recreate models with fidelity.
  • General observations 
  • @shujisado emphasized the need for flexible definitions in AI, drawing from open-source community experiences. Agreed on the complexity of training data issues and supported the flexible approach of OSI in defining the preferred form of modification.
  • @quaid suggested practical approaches for evaluating data information and its adequacy for recreating AI models and proposed further experiments and community involvement to refine the understanding and application of data information in open-source AI.
Are we evaluating Licenses or Systems?
  • @jberkus asked whether OSAID will apply to licenses or systems, noting that current drafts focus on systems. He questioned if a certification program for reviewing systems as open source or proprietary is the intended direction.
  • @shujisado confirmed that discussions are moving towards certifying AI systems and pointed at an existing thread. He emphasized the need for evaluating individual components of AI systems and expressed concern about OSI’s capacity to establish a certification mechanism, highlighting that it would significantly expand OSI’s role.
Categories: FLOSS Research

Open Source AI Definition – Weekly update June 17

Open Source Initiative - Mon, 2024-06-17 12:52
Explaining the concept of Data information
  • After much debate regarding training data, @stefano published a summary of the positions expressed and some clarifications about the terminology included in draft v.0.0.8. You can read the rationale about it and share your thoughts on the forum
  • Initial thoughts:
    • @Senficon (Felix Reda) adds that while the discussion has highlighted the case for data information, it’s crucial to understand the implications of copyright law on AI, particularly concerning access to training data. Open Source software relies on a legal element (copyright licenses) and an access element (availability of source code). However, this framework does not seamlessly apply to AI, as different copyright regimes allow text and data mining (TDM) for AI training but not the redistribution of datasets. This discrepancy means that requiring the publication of training datasets would make Open Source AI models illegal, despite TDM exceptions that facilitate AI development. Also, public domain status is not consistent internationally, complicating the creation of legally publishable datasets. Consequently, a definition of Open Source AI that imposes releasing datasets would impede collaborative improvements and limit practical significance. Emphasizing data innovation can help maintain Open Source principles without legal pitfalls.
Concerns and feedback on anchoring on the Model Openness Framework
  • @amcasari expresses concern about the usability and neutrality of the “Model Openness Framework” (MOF) for identifying AI systems, suggesting it doesn’t align well with current industry practices and isn’t ready for practical application without further feedback and iteration.
  • @shujisado points out that the MOF’s classification of components doesn’t depend on the specific IP laws applied, but rather on a general legal framework, and highlights that Japan’s IP law system differs from the US and EU, yet finds discussions based on the OSD consistent.
  • @stefano emphasizes the importance of having well-thought-out, timeless principles in the Open Source AI Definition document, while viewing the Checklist as a more frequently updated working document. He also supports the call to see practical examples of the framework in use and proposes separating the Checklist from the main document to reduce confusion.
Initial Report on Definition Validation
  • Reviews of eleven different AI systems have been published. We do these review to check existing systems compatibility with our current definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
    • @mer has set up a review sheet for the Viking model upon request from @merlijn-sebrechts.
    • @anatta8538 asks if MLOps is considered within the topic of the Model Openness Framework and whether CLIP, an LMM, would be consistent with the OSAID.
    • @nick clarifies that the evaluation focuses on components as described in the Model Openness Framework, which includes development and deployment aspects but does not cover MLOps as a whole.
Why and how to certify Open Source AI
  • @Alek_Tarkowski agrees that certification of open-source AI will be crucial under the AI Act and highlights the importance of defining what constitutes an Open Source license. He points out the confusion surrounding terms like “free and open source license” and suggests that the issue of responsible AI licensing as a form of Open Source licensing needs resolution. Notes that some restrictive licenses are gaining traction and may need consideration for exemption from regulation, thus urging for a consensus.
Open Source AI Definition Town Hall – June 14, 2024

Slides and the recording of our previous townhall meeting can be found here.

Categories: FLOSS Research

Explaining the concept of Data information

Open Source Initiative - Fri, 2024-06-14 09:53

There seems to be some confusion caused by the concept of Data information included in the draft v0.0.8 of the Open Source AI Definition. Some readers may have seen the original dataset included in the list of optional components and quickly jumped to the wrong conclusions. This post clarifies how the draft arrived at its current state, the design principles behind the Data information concept and the constraints (legal and technical) it operates under.

The objective of the Open Source AI Definition

The objective of the Open Source AI Definition is to replicate in the context of artificial intelligence (AI) the principles of autonomy, transparency, frictionless reuse, and collaborative improvement for end users and developers of AI systems. These are described in the preamble.

Following the preamble is the definition of Open Source AI, an adaptation of the definition of Free Software (also known as “the four freedoms”) to AI nomenclature. The preamble and the four freedoms have been co-designed over several meetings and public discussions, online and in-person, and have not recently received significant comments. 

The Free Software definition specifies that a precondition to the freedom to study and modify a program is to have access to the source code. Source code is defined as “the preferred form of the program for making changes in.” Draft v0.0.8 contains a description of what’s necessary to enjoy the freedoms to study and modify an AI system. This new section titled Preferred form to make modifications to machine-learning systems has generated a heated debate. 

What is the preferred form to make modifications

The concept of “preferred form to make modifications” focuses on machine learning systems because these systems require data and training to produce a working system. Other AI systems are more easily classifiable as software and don’t require a special definition. 

The system analysis phase of the co-design process revealed that studying and modifying machine learning systems requires data, code for training and inference and model parameters. For the parameters, there’s no ambiguity: an Open Source AI must make them available under terms that respect the Open Source principles (no field-of-use restrictions, no discrimination against people, etc). For the data and code requirements, the text in the “preferred form to make modifications” section is longer and harder to parse, generating some confusion. 

The intent of the code and data requirements is to  ensure that end users, deployers and developers of an Open Source AI system have all the tools and instructions to recreate that AI system from scratch, to satisfy the freedoms to study and modify the system. At a high-level view, it makes sense to suggest that training datasets should be mandatorily released with permissive licenses in order to be Open Source AI.

However on close examination, it became clear that sharing the original datasets is full of traps. It actually puts Open Source at a disadvantage compared to opaque and proprietary AI systems.

The issue with data

Data is not software: The legal landscape for data is much wider than copyright. Aggregating large datasets and distributing them internationally is an endless nightmare that includes privacy laws, copyright, sui-generis rights, patents, secrets and more. Without diving deeper into legal issues, let’s focus on practical examples to clarify why the distribution of the training dataset is not spelled out as a requirement in the concept of Data information.

  • The Pile, the open dataset used to train the very open Pythia models, was taken down after an alleged copyright infringement, currently being litigated in the United States. However, the Pile appears to be legal to share in Japan. It’s also unclear whether it can be legally shared in the European Union. 
  • DOLMA, the open dataset used to train the very open OLMo models, was initially released with a restrictive license. It later switched to a permissive one. On further inspection, DOLMA appears to suffer from the same legal uncertainties of the Pile, however the Allen Institute has not been sued yet.
  • Training techniques that preserve privacy like federated learning don’t create datasets. 

All these cases show that requiring the original datasets creates vagueness and uncertainty in applying the Open Source AI Definition:

  • If a dataset is only legal in Japan, is that AI Open Source only in Japan?
  • If a dataset is initially legally available but later retracted, does the AI go from being Open Source to not?
    • If so, what happens to the applications that use such AI?
  • If no dataset is created, then will any AI trained with such techniques ever be Open Source?

Additionally, there are reasons to believe that OpenAI, Anthropic and other proprietary systems have been trained on the same questionable data inside The Pile and DOLMA: Proving that’s the case is a lot harder and expensive though. This is clearly a disincentive to be open and transparent on the data sources, adding a burden to the organizations that try to do the right thing.

The solution to these questions, draft v0.0.8 contains the concept of Data information, coupled with code requirements to obtain the expected result: for end users, developers and deployers of AI systems to be able to reproduce an Open Source AI.

Understanding the concept of Data Information

Data information, in the draft Open Source AI Definition, is defined as: 

Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

Read that from the end: The intention of Data information is to allow developers to recreate a substantially equivalent system using the same or similar data. That means that an Open Source AI must disclose all the ingredients, where they’ve been bought and all the instructions to prepare the dish.  

This is a solution that came out of the co-design process, where reviewers didn’t rank the training datasets as high as they ranked the training code and data transparency requirements. 

Data information and the code requirements also address all of the questions around the legality of distributing data and datasets, or their absence.

If a dataset is only legal in Japan or becomes illegal later, one should still be able to recreate a dataset suitable to train an equivalent system replacing the illegal or unavailable pieces with similar ones.

AI systems trained with federated learning (where a dataset isn’t created) can still be Open Source AI if all instructions and code are released so that a new training with different data can generate an equivalent system.

The Data information concept also solves an example (raised on the forum) of an AI system trained on data licensed directly from Reddit. In this case, if the original developers released enough information to allow another AI developer to recreate a substantially equivalent system with Reddit data taken from an existing dataset, like CommonCrawl, it would be considered Open Source AI.

The proposed alternatives

While generally well received, draft v0.0.8 has been criticized by a few people on the forum for putting the training dataset in the “optional requirements”. Some suggestions and pushback we’ve received:

  • Require the use of synthetic data when the training dataset cannot be legally shared: This technique may work in some corner cases, if the technology evolves to be reliable enough. It’s expensive and untested at scale.
  • Classify as Open Source AI systems where all their components are “open source”: This approach is not rooted in the longstanding practice of the GNU project to accept system library exceptions and other compromises in exchange for more Open Source tools.
  • Datasets built by crawling the internet are the equivalent of theft, they shouldn’t be allowed  at all, let alone allowed in Open Source AI: This pushback ignores the reality that large data aggregators already have acquired legally the rights to accumulate that same data (through scraping and terms of use) and are trading it, exclusively capturing the economic value of what should be in the commons. Read Towards a Books Data Commons for AI Training for more details. There is no general agreement that text and data mining is equivalent to theft.

These demands and suggestions are hard to accept. We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones. We need a Definition that contains positive examples from the start so we can practically demonstrate positive qualities to policymakers. 

The discussion about data, how to generate incentives to create datasets that can be distributed internationally, safely, preserving privacy, is extremely complex. It can be addressed separately from the Open Source AI Definition. In collaboration with Open Future Foundation and others, OSI is designing a series of conferences to tackle the data governance issue. We’ll make an announcement soon.

Have your say now

The concept of Data information and code requirements is hard to grasp at first. But the preliminary results of the validation phase confirm that the draft v0.0.8 works as expected: Pythia and OLMo both would be Open Source AI, while Falcon, Grok, Llama, Mistral would not (even if they used OSD-compatible licenses) because they don’t share Data information. BLOOM and StarCoder would fail because of field-of-use restrictions in their models.

Data information can be improved but it’s better than other solutions proposed so far. As we get closer to the release of the stable version of the Open Source AI Definition, we need to hear from you: If you support this concept please comment on the forum today. If you don’t support it, please try to propose an alternative that at least covers the practical examples of Pile, DOLMA and federated learning above. Help the community move the conversation forward.

Continue the conversation in the forum

Categories: FLOSS Research

Open Source AI Definition – Weekly update June 10

Open Source Initiative - Tue, 2024-06-11 17:40
Open Source AI needs to require data to be viable
  • With many different discussions happening at once, here are the main points:
    • On the issue of training data
      • @mark is concerned with openness of AI not being meaningful if there is not a focus on the training data.” Model weights are the most inscrutable component of current generative AI, and providers that release only [the weights] should not get a free ‘openness’ pass.”
      • @stefano agrees with all of that but questions the criteria used to assign green marks in Mark’s paper, pointing out inconsistencies. They use the example of Pythia-Chat-Base-7, which relies on a dataset from OpenDataHub with potential issues like non-versioned data and stale links, failing to meet stringent requirements required by @juliaferraioli. Similar concerns are raised for other models like OLMo 7B Instruct, which lack specific data versioning details. Maffulli also highlights the case of Pythia-7B, which once may have been compliant but it’s now problematic due to the unavailability of its foundational dataset, the Pile, illustrating the complexities in maintaining an “open source” status over time, if the stringent proposal suggested by @juliaferraioli and the AWS team is adopted.
      • @shujisado adds that while he sympathizes with @juliaferraioli‘s request for datasets, @stefano‘s arguments in support of the concept of “Data information” are aligned with the OSI principles and are reasonable.
      • @spotaws stresses that “data information” alone is insufficient if the data itself is too vague.
      • @juliaferraioli adds that while replicating AI systems like OLMo or Pythia may seem impractical due to costs and statistical nature, the capability is crucial for broader adoption and consistency.  She finds the current definition to be unclear and subjective.
      • @zack recommends to review StarCoder2, recognizing that it would be in the same category of BLOOM: a system with lots of transparency and a dataset made available but released with a restrictive license.
      • @Ezequiel_Lanza joined the conversation in support of the concept of Data information, claiming, with technical arguments that “sharing the dataset is not necessarily required and may not justify the potential risks associated with making it mandatory.”
      • Partially open / restrictive licenses
        • Continuing @marks points regarding restrictive licenses (like the ethical licenses), @stefano has added a link to an article highlighting some reasons why OSI is staying away from these licenses.
        • @pchestek further adds that a partially open license would create even more opportunities for open washing, as “open source AI” could have many meanings.
        • @mark clarified that rather than proposing a variety of meanings, they are seeking to highlight the dimensions of openness in their paper, exploring the broader landscape. 
        • @stefano adds that in the 26 years of OSI, it has contended with numerous organizations claiming varying degrees of openness as “open source. This issue is now mirrored in AI, as companies seek the market value of being labeled Open Source. Open Source is binary: either users have full rights or they don’t, and any system that falls short is not Open Source AI, regardless of how “almost” open it is.
      • Field of use/restriction 
        • @juliaferraioli believes that OSAID should include prohibitions against field-of-use restrictions.
        • @shujisado adds that OSAID specifies four freedoms as requirements for being considered open source and that this should be understood as the same since “freedom” is the same as “non-restricted”. The 10 clauses of the OSD have been replaced by the checklist in draft v0.0.8.
        • @juliaferraioli adds that individual components may be covered by their individual licenses, but the overall system may be subject to additional terms, which is why we need this to be explicit.
Initial Report on Definition Validation
  • @Mer has added how far we are regarding our system analysis compared to our current draft definition. Some points that remain incomplete have been highlighted.
  • Mistral (Mixtral 8x7B) is considered not in alignment with the OSAID because its data pre-processing code is not released under an OSI-approved license.
Can a derivative of non-open-source AI be considered Open Source AI?
  • @tarek_ziade shares his experience fine-tuning a “small” model (200M parameters) for a Firefox feature to describe images, using a base model for image encoding and text decoding. Despite not having 100% traceability of upstream data, Tarek argues that intentional fine-tuning and transparency make the new fine-tuned model open source. Any issues arising from downstream data can be addressed by the project maintainers, maintaining the model’s open source status.
Town hall recording out
  • We held our 10th town hall meeting a week and a half ago. You can access the recording here if you missed it.
  • A new town hall meeting is scheduled for this Friday, June 14.
Categories: FLOSS Research

Contributions of Open Source to AI: a panel discussion at CPDP-ai conference

Open Source Initiative - Tue, 2024-06-04 05:00

I participated as a panelist at the CPDP-ai 2024 conference in Brussels last week where we discussed the significant contributions of Open Source to AI and highlighted the specific properties that differentiate Open Source AI from proprietary solutions. Representing the Open Source Initiative (OSI), the globally recognized non-profit that defines the term Open Source, I emphasized the longstanding principle of granting users full agency and control over technology, which has been proven to deliver extensive social benefits.

Below is a glimpse at the questions and answers posed to me and my fellow panelists:

Question: Stefano, please explain what the contribution to AI from Open Source is, and if there are specific properties of Open Source AI that make a difference for the users and for the people who are confronted with its results.

Response: The Definition of Open Source Software has existed for over 25 years; That doesn’t apply to AI. The Open Source Definition for software provides a stable north star for all participants in the digital ecosystem, from small and large companies to citizens and governments.

The basic principle of the Open Source Definition is to grant to the users of any technology full agency and control over the technology itself. This means that users of Open Source technologies have self-sovereignty of the technical solutions.

The Open Source Definition has demonstrated that massive social benefits accrue when you remove the barriers to learning, using, sharing and improving software systems. There is ample evidence that giving users agency, control and self-sovereignty of their technical choices produces a viable ecosystem based on permissionless innovation. Multiple studies by the EU Commission and Harvard researchers have assigned significant economic value to Open Source Software, all based on that single, clear, understood and approved Definition from 26 years ago.

For AI, and especially the most recent machine learning solutions, it’s less clear how society can maintain self-sovereignty of the technology and how to achieve permissionless innovation. Despite the fact that many people talk about Open Source AI, including the AI Act, there is no shared understanding of what that means, yet!

The Open Source Initiative is concluding a global, multi-stakeholder co-design process to find an unequivocal definition of Open Source AI, and we’re heading towards the conclusion of this process with a vastly increased knowledge of the AI machine learning space. The current draft of the Open Source AI Definition recognizes that in order to study, use, share and modify AI, one needs to refer to an AI system, not a single individual component. The global process has identified the components required for society to maintain control of the technology and these are: 

  • Detailed information about the dataset used to train the system and the code so that a skilled person can train a system with similar capabilities
  • All the libraries and tools used to run training and inference
  • The model architecture and the parameters, like weights and biases

Having unrestricted access to all these elements is what makes an AI an Open Source AI.

We’re in the final stretch of the process, starting to gather support for the current draft of the definition.

The most controversial part of the discussion is the role of data in the training. To answer your question about the power of big foreign tech companies, putting aside the hardware requirements, the data is where the fight is. There seem to be two views of the world on data when it comes to AI: One thinks that text and data mining is basically strip mining humanity and all accumulation of data without consent of the rights holders must be made illegal. Another view of the world is that text and data mining for the purpose of training Open Source AI is probably the only antidote to the superpowers of large corporations. These camps haven’t found a common position yet. Japan seems to have made up its mind already, legalizing unrestricted text and data mining. We’ll see where the lawsuits in the US will go, if they ever get to a decision in court or, as I suspect, they will be settled out of court. 

In any case, data, competence and to some extent hardware, are the levers to control the development of AI. 

Open Source has been leveling the playing field of technologies. We know from past experience with Open Source software that giving people unrestricted access to the means of digital production enables tremendous economic value. This worked in Europe as well as in China. We think that Open Source AI can have the same effect of generating value while leaving control of the technology in the hands of society.

Question: Big tech companies are important for the development of AI. Apart from the purely technological impacts, there is also economic importance. The European Commission has been very concerned about the Digital Single Market recently, and has initiated legislation such as DSA and DMA to improve competition and market access. Will these instruments be sufficient in view of AI roll-out, thinking also of the recently adopted AI Act? Or will additional attention need to be paid?

Response: Open is the best antidote to the concentration of power. That said, I see these legislations as the sticks, very necessary. I’d love us to think also about carrots. We don’t want to repeat the mistakes of the past with the early years of the internet. Open Source software was equally available in the US and Europe but despite that, the few European champions of Open Source haven’t grown big enough to have a global impact. And some of the biggest EU companies aren’t exactly friendly with Open Source either. 

Chinese companies have taken a different approach. But in Europe we have talents, and we have an attractive quality of life so we can get even more talents. Finding money is never an issue. We need to remove the disincentives to grow our companies bigger, widen the access to the internal EU market and support their international expansion, too.

For example, we need to review European Regulation 1025, on standardization to accommodate for Open Source. 1025 Regulation was written at a time when Open Source was considered a “business model” and information and communication technology standards were about voltages in a wire. Today, Open Source is between 80% and 90% of all software and “digital elements” comprise some part of every modern product. Even hardware solutions are dominated by “digital elements.” As such, the approach taken by 1025 is out of date and most likely needs a root-and-branch rethink to properly apply to the world today and the world we anticipate tomorrow.

We need to make sure that the standardization rules required by the Cyber Resilience Act are written together with Open Source champions so the rules don’t favor exclusively the cartel of European patent holders who try to seek rent instead of innovating. Europe has all the means to be at the center of AI innovation; It embodies the right values of diversity and collaboration. 

Closing remarks: We think that Open Source is the best antidote to fight market concentration in AI. Data is where the concentration of power is happening now and it’s in the hands of massive corporations: not only Google, Meta, Amazon, Reddit but also Sony, Warner, Netflix, Getty Images, Adobe … All these companies have already gained access to massive amounts of data, legally. These companies basically own our data, legally: Our pictures, the graph of our circles of friends, all the books and movies… 

There is a risk that if we don’t write policies that allow text and data mining in exchange of a real Open Source AI (one that society can fully control) then we risk leaving the most powerful AI systems in the hands of the oligopoly who can afford trading money for access to data.

Categories: FLOSS Research

Open Source AI Definition – Weekly update June 3

Open Source Initiative - Mon, 2024-06-03 14:27
Initial report on definition validation
  • A first draft of the report of the validation phase has been published. The validation phase is designed to review the compatibility of existing systems with the current draft definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
  • Problems and initial findings:
    • Elusive documents: Not having system creators involved meant reviewers had to independently search for legal documents, resulting in many blanks in the document list and subsequent analysis.
    • One component, many artifacts, and documents: Some components were linked to multiple artifacts and documents, complicating the review process as source code and documentation could be spread across several repositories and reports.
    • Compounded components: Components in the checklist often combined multiple artifacts, such as training and validation code, making it difficult to track down specific legal documents.
    • Compliant? Conformant? Six out of eleven required components need a legal framework that is “compliant” or “conformant” with the Open Source Definition, prompting a need for clearer guidance on reviewing non-software components.
    • Reverting to the license: Reviewers suggested simplifying the process by relying on whether a legal document is OSI-approved, conformant, or compliant to guarantee the right to use, study, modify, and share the component, eliminating the need for independent assessment.
  • Next steps:
    • As we are looking to fill in the gaps from above we call on both system creators and independent volunteers to complete various system reviews. 
    • If your familiar system is not on the list, contact Mer on the forum
  • Initial questions and queries:
    • @jasonbrooks asks if the validation process should check if there’s “sufficiently detailed information about the data used to train the system so a skilled person can recreate a substantially equivalent system.” It’s unclear if this has been confirmed, and examples of skilled individuals achieving this would be helpful.
      • @stefano replies that the Preferred form lists enduring principles, while the Checklist details required components. Validation ensures components like training methodologies and data provenance are available, enabling system recreation. Mer’s report highlights the difficulty in finding these components, suggesting a need for a better method. One idea is a detailed survey for AI developers, though companies like Meta might misuse the “Open Source” label. Public pressure may eventually deter such abuses.
    • @amcasari adds insights into the process of reviewing licenses.
Open Source AI needs to require data to be viable 
  • This week, the conversation shifted heavily toward the possibilities of creating a gradient approach to open licensing.
  • @Markhas shared that he is publishing a paper regarding open washing, the AI ACT, and a case for a gradient notion of openness.
    • In line with previous points mostly raised by @danish_contactor, Mark highlights the RAIL licenses and argues that it should count towards openness too, stating that “I think providers and users of LLMs should not be free to create oil spills in our information landscape and I think RAIL provides useful guardrails for that.”
    • They also present their visualization of the degrees of openness of different systems 
  • @stefano has reiterated that the open-source AI definition will remain binary, just like the Open Source Definition is binary. And responding to @Markhas and @danish_contactor, he linked to Kate Downing legal analysis of RAIL licensing framework.
Can a derivative of non-open-source AI be considered Open Source AI? 
  • Answering @stefano’s earlier questions, @mark adds that it’s challenging to fine-tune a model without knowing the initial training data and techniques. Examples like Meta and Mistral fine-tunes show success despite the lack of transparency in the original training data. Intel’s Neural 7B and AllenAI’s Tulu 70B demonstrate effective fine-tuning with detailed disclosure of fine-tuning steps and data. However, these efforts can’t qualify as truly open AI systems due to the closed nature of the base models and potential legal liabilities.
  • @stefano closed the topic stating that, based on feedback, “Derivatives of non-Open Source AI cannot be Open Source AI”
Why and how to certify Open Source AI
  • @amscott added that AI developers will likely self-certify compliance with the OSAID, with objective certification needed for arbitration in nuanced cases. Like the OSD, the OSAID will mature through community practice. A simple self-certification tool could promote transparency and document good practices.
  • @mark added that The EU AI Act emphasizes “Open Source” systems, offering exemptions attractive to companies like Meta and Mistral. The AI Act requires disclosure templates overseen by an AI Office, potentially leading to intense lobbying efforts. If Open Source organizations influence regulation and certification, transparency may strengthen the Open Source ecosystem.
Question regarding the 0.0.8 definition 
  • Question from @Jennifer Ding regarding why “information” is a focus for the data category and not the code and model categories.
  • @Matt White adds that OSD-Conformant (in the checklist) should be defined somewhere.
    • He further adds (to Data Information, under checklist) that many “open” models withhold various forms of data, making it unreasonable to expect model producers to release all the information necessary for full replication of the data pipeline if data is not a required component of the definition
  • @Micheal Dolan adds that ”the use of OSD-compliant and OSD-conformant without any definitions of either term is difficult to parse the meaning of.” and suggests some solutions.
OSAID at PyCon US
  • Missing a recap of how we got to where we are now? OSI was present at PyCon in Pittsburgh where we held a workshop regarding our current definition and spoke with many knowledgeable shareholders. You can read about it here.
Categories: FLOSS Research

OSI at PyCon US: engaging with AI practitioners and developers as we reach OSAID’s first release candidate

Open Source Initiative - Wed, 2024-05-29 08:00

As part of the Open Source AI Definition roadshow and as we approach the first release candidate of the draft, the Open Source Initiative (OSI) participated at PyCon US 2024, the annual gathering of the Python community. This opportunity was important because PyCon US brings together AI practitioners and developers alike, and having their input regarding what constitutes Open Source AI is of most value. The OSI organized a workshop and had a community booth there.

OSAID Workshop: compiling a FAQ to make the definition clear and easy to use

The OSI has embarked on a co-design process with multiple stakeholders to arrive at the Open Source AI Definition (OSAID). This process has been led by Mer Joyce, the co-design expert and facilitator, and Stefano Maffulli, the executive director of the OSI.

At the workshop organized at PyCon US, Mer provided an overview of the co-design process so far, summarized below.

The first step of the co-design process was to identify the freedoms needed for Open Source AI. After various online and in-person activities and discussions, including five workshops across the world, the community identified four freedoms:

  1. To Use the system for any purpose and without having to ask for permission.
  2. To Study how the system works and inspect its components.
  3. To Modify the system for any purpose, including to change its output.
  4. To Share the system for others to use with or without modifications, for any purpose.

The next step was to form four working groups to initially analyze four AI systems. To achieve better representation, special attention was given to diversity, equity and inclusion. Over 50% of the working group participants are people of color, 30% are black, 75% were born outside the US and 25% are women, trans and nonbinary.

These working groups discussed and voted on which AI system components should be required to satisfy the four freedoms for AI. The components we adopted are described in the Model Openness Framework developed by the Linux Foundation.

The vote compilation was performed based on the mean total votes per component (μ). Components which received over 2μ votes were marked as required and between 1.5μ and 2μ were marked likely required. Components that received between 0.5μ and μ were marked likely not required and less than 0.5μ as not required.

The working groups evaluated legal frameworks and legal documents for each component. Finally, each working group published a recommendation report. The end result is the OSAID with a comprehensive definition checklist encompassing a total of 17 components. More working groups are being formed to evaluate how well other AI systems align with the definition.

OSAID multi-stakeholder process: from component list to a definition checklist

After providing an overview of the co-design process, Mer went on to organize an exercise with the participants to compile a FAQ.

The questions raised at the workshop revolved around the following topics:

  • End user comprehension: how and why are AI systems different from Open Source software? As an end-user, why should they care if an AI system is open?
  • Datasets: Why is data itself not required? Should Open Source AI datasets be required to prove copyright compliance? How can one audit these systems for bias without the data? What does data provenance and data labeling entail?
  • Models: How can proper attribution of model parameters be enforced? What is the ownership/attribution of model parameters which were trained by one author and then “fine-tuned” by another?
  • Code: Can projects that include only source code (no data info or model weights) still use a regular Open Source license (MIT, Apache, etc.)?
  • Governance: For a specific AI, who determines whether the information provided about the training, dataset, process, etc. is “sufficient” and how?
  • Adoption of the OSAID: What are incentives for people/companies to adopt this standard?
  • Legal weight: Is the OSAID supposed to have legal weight?

These questions and answers raised at the workshop will be important for enhancing the existing FAQ, which will be made available along with the OSAID.

OSAID workshop: a collection of post-its with questions raised by participants. Community Booth: gathering feedback on the “Unlock the OSAID” visualization

At the community booth, the OSI held two activities to draw in participants interested in Open Source AI. The first activity was a quiz developed by Ariel Jolo, program coordinator at the OSI, to assess participants’ knowledge of  Python and AI/ML. Once we had an understanding of their skills, we went on to the second and main activity, which was to gather feedback on the OSAID using a novel way to visualize how different AI systems match the current draft definition as described below.

Making it easy for different stakeholders to visualize whether or not an AI system matches the OSAID is a challenge, especially because there are so many components involved. This is where the visualization concept we named “Unlock the OSAID” came in. 

The OSI keyhole is a well recognized logo that represents the source code that unlocks the freedoms to use, study, modify, and share software. With the Unlock the OSAID, we played on that same idea, but now for AI systems. We displayed three keyholes representing the three domains these 17 components fall within: code, model and data information.

Here is the image representing the “code keyhole” with the required components to unlock the OSAID:

On the inner ring we have the required components to unlock the OSAID, while on the outer ring we have optional components. The required code components are: libraries and tools; inference; training, validation and testing; data pre-processing. The optional components are: inference for benchmark and evaluation code. 

To fully unlock the OSAID, an AI system must have all the required components for code, model and data information. To better understand how the “Unlock the OSAID” visualization works, let’s look at two hypothetical AI systems: example 1 and example 2.

Let’s start looking at example 1 (in red) and see if this system unlocks the OSAID for code:

Example 1 only provides inference code, so the key (in red) doesn’t “fit” the code keyhole (in green).

Now let’s look at example 2 (in blue):

Example 2 provides all required components (and more), so the key (in blue) fits the code keyhole (in green). Therefore, example 2 unlocks the OSAID for code. For example 2 to be considered Open Source AI, it would also have to unlock the OSAID for model and data information: 

We received good feedback from participants about the “Unlock the OSAID” visualization. Once participants grasped the concept of the keyholes and which components were required or optional, it was easy to identify if an AI system unlocks the OSAID or not. They could visually see if the keys fit the keyholes or not. If all keys fit, then that AI system adheres to the OSAID.

Final thoughts: engaging with the community and promoting Open Source principles

For me, the highlight of PyCon US was the opportunity to finally meet members of the OSI and the Python community in person, both new and old acquaintances. I had good conversations with Deb Nicholson (Python Software Foundation), Hannah Aubry (Fastly), Ana Hevesi (Uploop), Tom “spot” Callaway (AWS), Julia Ferraioli (AWS), Tony Kipkemboi (Streamlit), Michael Winser (Alpha-Omega), Jason C. MacDonald (OWASP), Cheuk Ting Ho (CMD Limes), Kamile Demir (Adobe), Mariatta Wijaya (PSF), Loren Clary (PSF) and Miaolai Zhou (AWS). I also interacted with many folks from the following communities: Python Brazil, Python en Español, PyLadies and Black Python Devs. It was great to bump into great legends like Seth Larson (PSF), Peter Wang (Anaconda) and Guido van Rossum.

I loved all the keynotes, in particular from Sumana Harihareswara about how she has improved Python Software Foundation’s infrastructure, and from Simon Willison about how we can all benefit from Open Source AI.

We also had a special dinner hosted by Stefano to celebrate this special milestone of the OSAID, with Stefano, Mer and I overlooking Pittsburgh.

Overall, our participation at PyCon US was a success. We shared the work OSI has been doing toward the first release candidate of the Open Source AI Definition, and we did it in an entertaining and engaging way, with plenty of connection throughout.

Photo credits: Ana Hevesi, Mer Joyce, and Nick Vidal

Categories: FLOSS Research

Open Source AI Definition – Weekly update May 27

Open Source Initiative - Tue, 2024-05-28 04:41
Open Source AI needs to require data to be viable
  • @juliaferraioli and the AWS team have reopened the debate regarding access to training data. This comes in a new forum which mirrors concerns raised in a previous one. They argue that to achieve modifiability, an AI system must ship the original training dataset used to train it. Full transparency and reproducibility require the release of all datasets used to train, validate, test, and benchmark. For Ferraioli, data is considered equivalent to source code for AI systems, therefore its inclusion should not be optional. In a message signed by the AWS Open Source team, she proposed that original training datasets or synthetic data with justification for non-release be required to meet the Open Source AI standard.
  • @stefano added some reminders as we reopen this debate. These are the points to keep in mind:
    • Abandon the mental map that makes you look for the source of AI (or ML) as that map has been driving us in circles. Instead, we’re looking for the “preferred form to make modifications to the system”
    • The law in most legislation around the world makes it illegal to distribute data, because of copyright, privacy and other laws. It’s also not as clear how the law treats datasets and it’s constantly changing 
    • Text of draft 0.0.8 is drafted to be vague on purpose regarding “Data information”. This is to resist the test of time and technology changes. 
    • When criticizing the draft, please provide specific examples in your question, and avoid arguing in the abstract. 
  • @danish_contractor argues that the current draft is likely to disincentivize openness due to the community viewing models (BLOOM or StarCoder), which include usage restrictions to prevent harms, less favorably despite being more transparent, reproducible, and thus more “open” than models like Mistral.
  • @Pam Chestek clarified that Open Source has two angles: the rights to use, study, modify and share, coupled with those rights being unrestricted. Both are equally important.
  • This debate echoes earlier ones on recognizing open components of an AI system.
The FAQ page has been updated
  • The FAQ page is starting to take shape and we would appreciate more feedback. So far, we have preliminary answers to these questions:
    • Why is the original training dataset not required?
    • Why the grant of freedoms is to its users?
    • What are the model parameters?
    • Are model parameters copyrightable?
    • What does “Available under OSD-compliant license” mean?
    • What does “Available under OSD-conformant terms” mean?
    • Why is the Open Source AI Definition includes a list of components while the Open Source Definition for software doesn’t say anything about documentation, roadmap and other useful things?
    • Why is there no mention of safety and risk limitations in the Open Source AI Definition?
Draft v0.0.8 Review from LLM360
  • @vamiller has submitted on behalf of the LLM360 team a review of their models. In his view the v0.0.8 reflect the principles of Open Source applied to AI. He asks about the ODC-By licence, arguing that it is compatible with OSI’s principles but it’s a data-only license.
Join the next town hall meeting
  • The next town hall meeting will take place on May 31st at 3:00 pm – 4:00 pm UTC. We encourage all who can participate to attend. This week, we will delve deeper into the issues regarding access (or not) to training data.
Categories: FLOSS Research

Exploring openness in AI: Insights from the Columbia Convening

Open Source Initiative - Thu, 2024-05-23 08:00

Over the past year, a robust debate has emerged regarding the benefits and risks of open sourcing foundation models in AI. This discussion has often been characterized by high-level generalities or narrow focuses on specific technical attributes. One of the key challenges—one that the OSI community is addressing head on—is defining Open Source within the context of foundation models. 

A new framework is proposed to help inform practical and nuanced decisions about the openness of AI systems, including foundation models. The recent proceedings from the Columbia Convening on Openness in Artificial Intelligence, made available for the first time this week, are a welcome addition to the process.

The Columbia Convening brought together experts and stakeholders to discuss the complexities and nuances of openness in AI. The goal was not to define Open Source AI but to illuminate the multifaceted nature of the issue. The proceedings reflect the February conversations and are based on the backgrounder text developed collaboratively with the working group.

One of the significant contributions of these proceedings is the framework for understanding openness across the AI stack. The framework summarizes previous work on the topic, analyzes the various reasons for pursuing openness, and outlines how openness varies in different parts of the AI stack, both at the model and system levels. This approach provides a common descriptive framework to deepen a more nuanced and rigorous understanding of openness in AI. It also aims to enable further work around definitions of openness and safety in AI.

The proceedings emphasize the importance of recognizing safety safeguards, licenses, and documents as attributes rather than components of the AI stack. This evolution from a model stack to a system stack underscores the dynamic nature of the AI field and the need for adaptable frameworks.

These proceedings are set to be released in time for the upcoming AI Safety Summit in South Korea. This timely release will help maintain momentum ahead of further discussions on openness at the French summit in 2025.

We’re happy to see collaboration of like-minded individuals in discussing and solving the varied problems associated with openness in AI.

Categories: FLOSS Research

Open Source AI Definition – Weekly update May 20

Open Source Initiative - Mon, 2024-05-20 10:43

A week loaded with important questions.

Overarching concerns with Draft v.0.0.8 and suggested modifications

A post signed by the AWS Open Source raised important questions, illustrating a disagreement on the concept of “Data information.”

  • A detailed post signed by the AWS Open Source team raises concerns about the draft concept of Data information in v0.0.8 and other important topics. I suggest reading their post. The major points discussed this week are:
    • The discussion on training data is not settled. AWS Open Source team argues that for an Open Source AI Definition to be effective, the data used to train the AI system must be included, similar to the requirement for source code in Open Source software. They say the current definitions mark the inclusion of datasets as optional, undermining transparency and reproducibility.
    • Their suggestion: Use synthetic data where the inclusion of actual datasets poses legal or privacy risks.
      • Valentino Giudice takes issues with the phrase “or AI systems, data is the equivalent of source code,” and states that “equivalent” is used too liberally here. For trained models, the dataset isn’t necessary to understand the model’s operations, which are determined by architecture and frameworks.
        • Ferraioli disagrees, stating that “A trained model cannot be considered open source without the data, processing code, and training code. Comparing a trained model to a software binary, we don’t call binaries open source without the source code being available and licensed as open source. “
      • Zacchiroli adds that they support the suggestion to use “high quality equivalent synthetic datasets” when the original data cannot be released. Although “equivalent” remains undefined and could create loopholes, this issue doesn’t worsen OSAID
    • Some proposed modifications otherwise include:
    • Require Release of Dependent Datasets
      • Mandate the release of training, testing, validation, and benchmarking datasets under an open data license or high-quality synthetic data if legal restrictions apply.
      • Update the “Data Information” section to make dataset release a requirement.
  • Prevent Restrictions on Outputs
    • Prohibit restrictions on the use, modification, or distribution of outputs generated by Open Source AI systems.
  • Eliminate Optional Components
    • Remove optional components from the OSAID to maintain a high standard of openness and transparency.
  • Address Combinatorial Ambiguity
    • Ensure any license applied to the distribution of multiple components in an Open Source AI system is OSD-approved.
Why and how to certify Open Source AI
  • The post from AWS team contained a comment about certification process for Open Source AI that deserves a separate thread. There are pending questions to be answered:
    • who exactly needs a certification that an AI system is Open Source AI?
    • who is going to use such certification? Is anyone of the groups deploying open foundation models today thinking that they could use one? For what purpose?
    • who is going to consume the information carried by the certification, why and how?
  • Zacchiroli adds that the need for certifying AI systems as OSAID compliant arises from inherent ambiguities in the definitions, such as terms like “sufficiently” and “high quality equivalent synthetic dataset.” Disagreements on compliance will require a judging authority, akin to OSI for the OSD. While managing judgments for OSAID might be more complex due to the potential volume, the community is likely to turn to OSI for such decisions.
Can a derivative of non-open-source AI be considered Open Source AI?
  • This question was asked on the draft document and moved to the forum for higher visibility. Is it technically possible to fine-tune a model without knowing the details of its initial training? Are there examples of successfully fine-tuned AI/ML systems where the initial training data and techniques were unknown but the fine-tuning data and methods were fully disclosed?
    • Shuji Sado added that fine-tuning typically involves updating the weights of newly added layers and some layers of the pre-trained model, but not all layers, to maintain the benefits of pre-training.
    • Valentino Giudice raised concerns over this point as multiple strategies for fine-tuning exist, allowing for flexibility in updating weights in any amount of existing layers without necessarily adding new ones. Even updating the entire network can be beneficial, as it leverages the pre-trained model’s information and can be more efficient than training a new model from scratch. Fine-tuning can slightly adjust the model’s performance or behaviour, integrating new data effectively.

Please, especially if you are knowledgeable in this field, we would love to hear more thoughts!

Categories: FLOSS Research

Unveiling ClearlyDefined: this free SBOM service gets cleared for takeoff

Open Source Initiative - Thu, 2024-05-16 09:43

With all the buzz around SBOMs and Open Source supply chain compliance and security, a new revolution is igniting at ClearlyDefined. This amazing project has been flying under the radar since its inception six years ago, but now this free service and open source project from the Open Source Initiative (OSI) gets cleared for takeoff with the launch of a new website focused on stellar documentation, excellent engineering, and healthy community growth.

Generating SBOMs at scale for each stage on the supply chain, for every build or release, has proven to be a real challenge for organizations. And fixing the same missing or wrongly identified licensing metadata over and over again has been a redundant pain for everyone. This is where ClearlyDefined shines, as it makes it really easy for organizations to fetch a cached copy of licensing metadata for each component through a simple API, which is always up-to-date thanks to its crowdsourced database.

The all-new ClearlyDefined website was completely revamped to welcome community members and foster collaboration united by a shared vision of Open Source excellence. The website is divided into three sections: Docs, Resources, and Community.

Under Docs, both new and existing community members will find several comprehensive guides and tutorials. The main guide is “Getting involved,” where members will embark on a journey to learn how to use the data, curate the data, contribute data, contribute code, add a harvest and adopt practices. The “Roles” guide provides a detailed description of how different roles can master ClearlyDefined, from data consumer and data curator to data contributor and code contributor. Other guides that will expand in the coming months include the “Curation” and “Harvest” guides. Curation is the process of fixing or identifying missing licensing metadata and sharing that with the community, while harvest is the process of fetching licensing metadata directly from the source (package managers like npm and PyPi), processing the license definitions, and making them available through an API.

Under Resources, members will find a rich collection of content: Blog, FAQ, Glossary, Providers, Architecture and Roadmap. The roadmap was created in collaboration with members of the community, who provided input into what they would like to see in 2024 and how they would be able to contribute towards these goals.

Under Community, members will find links to various channels where they can engage with others online or in-person: GitHub, Forum, Events and Meetings. They’ll also find a list of other community members with whom they can forge connections, as well as the Code of Conduct and the project Charter.
We would like to extend a heartfelt thank you to our existing community members who have been instrumental with the launch of the new website and welcome new ones who are learning about the project. Besides expanding the “Curation” and “Harvest” guides, next steps include enhancing the user experience by implementing sitewide search and adding case studies filled with rich media. Come and join the ClearlyDefined community here and get ready to take off together with us. Let’s define the future of Open Source, one definition at a time!

Categories: FLOSS Research

The Open Source AI Definition gets closer to reality with a global workshop series

Open Source Initiative - Wed, 2024-05-15 08:05

The OSI community is traveling to five continents seeking diverse input on how to guarantee the freedoms to use, study, share and modify Open Source AI systems.

SAN FRANCISCO – May 14, 2024 Open Source Initiative (OSI), globally recognized by individuals, companies and public institutions as the authority that defines Open Source, is driving a global multi-stakeholder process to define “Open Source AI.” This definition will provide a framework to help AI developers and users determine if an AI system is Open Source or not, meaning that it’s available under terms that allow unrestricted rights to use, study, modify and share. There are currently no accepted means by which openness can be validated for AI, yet many organizations are claiming their AI to be “Open Source.” Just as the Open Source Definition serves as the globally accepted standard for Open Source software, so will the Open Source AI Definition act as a standard for openness in AI systems and their components.

In 2022 the OSI started an in-depth global initiative to engage key players, including corporations, academia, the legal community and organizations and nonprofits representing wider civil society, in a collaborative effort to draft a definition of Open Source AI that ensures that society at large can retain agency and control over the technology. The project has increased in importance as legislators around the world started regulating AI, asking for feedback as guardrails are defined.

This open process has resulted in a massive body of work including podcasts, panel discussions, webinars, published reports, and a plethora of town halls, workshops and conference sessions around the world. A big emphasis was given to make the process as inclusive and representative as possible: 53% of the working groups were composed of people of color. Women and femmes, including transgender women, accounted for 28% of the total and 63% of those individuals are women of color. 

After months of weekly town hall meetings, draft releases and reviews the OSI is nearing a stable version of the Open Source AI Definition. Now, the OSI is embarking on a roadshow of workshops to be held on five continents to solicit input from diverse stakeholders on the draft definition. The goal is to present a stable version of the definition in October at the All Things Open event in Raleigh, North Carolina. This “Open Source AI Definition Roadshow” is sponsored by the Alfred P. Sloan Foundation, and OSI’s sponsors and donors.

“AI is different from regular software and forces all stakeholders to review how the Open Source principles apply to this space,” said Stefano Maffulli, executive director of the OSI. “OSI believes that everybody deserves to maintain agency and control of the technology. We also recognize that markets flourish when clear definitions promote transparency, collaboration and permissionless innovation. After spending almost two years gathering voices from all over the world to identify the principles of Open Source suitable for AI systems, we’re embarking on a worldwide roadshow to refine and validate the release candidate version of the Open Source AI Definition.”

The schedule of workshops is as follows: 

  • North America
  • Europe
  • Africa
    • Nigeria, Lagos, August tentative
  • Asia Pacific
    • Hong Kong, AI_Dev (August 23)
    • Asia – details TBD, DPGA members meeting (November 12 – 14)
  • Latin America
    • Argentina, Buenos Aires, Nerdearla (September 24 – 28)

For weekly updates, town hall recordings and access to all the previously published material, visit opensource.org/deepdive.

Supporters of the Open Source AI Definition Process

The Deep Dive: Defining Open Source AI co-design process is made possible thanks to grant 2024-22486 from Alfred P. Sloan Foundation, donations from Google Open Source, Cisco, Amazon and others, and donations by individual members. The media partner is OpenSource.net.
Others interested in offering support can contact OSI at sponsors@opensource.org.

Categories: FLOSS Research

Open Source AI Definition – Weekly update May 13

Open Source Initiative - Tue, 2024-05-14 11:08

Early thoughts on “Apple sample code license”?
  • Apple has released a license to distribute its new model, OpenELM. The license looks BSD/MIT-like with the exclusion of patents. According to you, does it seem OSD compliant?
    • Initial thoughts:
    • @pchestek added that the license appears to be similar to open source but raises concerns about potential limitations on rights, particularly regarding patents. It highlights Apple’s approach of granting only a copyright license, which might not be sufficient for ensuring all necessary freedoms, especially in the context of AI models
    • @shujisado agreed, saying that the terms related to trademarks and patents need to be scrutinized
Question regarding the 0.0.8 version 
  • @Aspie96 asks clarifying questions regarding the list of open components and points out how, unlike “traditional” software which can be released as open source software without as easy as proprietary software, this definition seem to require a lot more components to be open.
    • Stefano points out that “The “classic” Open Source Definition is applied to licenses, not to the software” and “ if a program is shipped with a license approved by the OSI then the software is considered Open Source”
    • He further states that “Through the co-design process of the Open Source AI Definition we learned that to use, study, share and modify an ML system one needs a complex combo of multiple components each following diverse legal regimes (not just the usual copyright+patents.) Therefore we must describe in more details what is required to grant users the agency and control expected.”
The FAQ page is being developed
  • The frequently asked questions page is starting to take form
  • We add relevant questions that have arisen from the forums so far, though if you have any contributions in mind, please leave a comment!
Open Source Initiative at PyCon!

This week, Stefano, Mer and the OSI team are visiting Pittsburgh, PA, hosting the first workshop of our Open Source AI Definition Roadshow! We are starting to get more in-person feedback on our draft definition.

If you are at PyCon come visit us on the 17th, from 11 am to 1pm in the Open Space area!

Categories: FLOSS Research

Why datasets built on public domain might not be enough for AI

Open Source Initiative - Tue, 2024-05-07 06:00

There is tension between copyright laws and large datasets suitable to train large language models. Common Corpus is a dataset that only uses text from copyright-expired sources to bypass the legal issues. It’s a useful achievement, paving the path to research without immediate risk of lawsuits. I also fear that this approach may lead to bad policies, reinforcing the power of copyright holders; not the small creators but large corporations. 

A dataset built on public domain sources

In March 2024 Common Corpus was released as an open access dataset for training large language models (LLMs). Announcing the release, the lead developer Pierre-Carl Langlais says “Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.” The dataset contains 500 billion words in multiple European languages and different cultural heritages. It is a project coordinated by the French startup Pleias and supported by organizations committed to open science such as Occiglot, Eleuther AI and Nomic AI as well as being partly funded by the French government. The stated intention of Common Corpus is to democratize access to large quality datasets. It has many other positive characteristics, highlighted also by Open Future’s summary of a talk given by Langlais

The commons needs more data

The debates sparked by the Deep Dive: AI process on the role of training data highlighted that AI practitioners encounter many obstacles assembling datasets. At the same time, we discovered that tech giants have an incredible advantage over researchers and startups. They’ve been slurping data for decades, have the financial means to go to court and can enter into bilateral agreements to license data. These strategies are inaccessible to small competitors and academics. Accepting that the only path to creating open large datasets suitable to train Open Source AI systems is to use sources in the public domain, risks cementing the dominant positions of existing large corporations.

The open landscape already faces issues with big tech and their ability to influence legislation. The big corporations have lobbied to extend the duration of copyright, introduced the DMCA, are opposing the right to repair, and have the resources to continue lobbying and sue any new entrant who they deem to get too close. There are plenty of examples showing an unequal advantage in protecting what they think is theirs. The non-profit Fairly Trained certifies companies “willing to prove that they’ve trained their AI models on data that they own, have licensed, or that is in the public domain,” respecting copyright law: who’s going to benefit from this approach?

Unsuitable for public policies

Initiatives like Common Corpus and The Stack (used to train Starcoder2) are important achievements as they allow researchers to develop new AI systems while mitigating the risk of being sued. They also push the technical boundaries of what can be achieved with smaller datasets that don’t require a nuclear power plant to train new models. But I think they mask the underlying issue: AI needs data and limiting open datasets to only public domain sources will never give them a chance to match the size of the proprietary ones. The lobby for copyright maximalists is always looking for ways to expand scope and extend terms for copyright laws, and when they succeed it is a one-way ratchet. It would be a tragedy for society if legislators listened to their sophistry and made new laws doing this based on the apparent consensus that creators need protection from AI.
The role of data for training machine learning systems is a divisive topic and a complex one. Having datasets like Common Corpus is a very useful way for the science of AI to progress with better sources. For policies, we’d be better off pushing for something like the proposal advanced by Open Future and Creative Commons in their paper Towards a Books Data Commons for AI Training.

Categories: FLOSS Research

Open Source AI Definition – Weekly update May 6

Open Source Initiative - Mon, 2024-05-06 12:02
Definition validation: Seeking volunteers

The process has entered a new phase: We are now seeking volunteers to validate the Open Source AI Definition, using it to review existing AI systems. The objective of the phase is to confirm that the Definition works as intended and understand where it fails.  

  • A spreadsheet is given where you locate and link to the license, research paper, or other document that grants rights or provides information for each required component. 
  • Systems include, but are not limited to:
    • Arctic
    • BLOOM
    • Falcon
    • Grok
    • Llama 2
    • Mistral
    • OLMo
    • OpenCV
    • Phi-2
    • Pythia
    • T5
  • To volunteer by May 20th, please contact Mer on the forum
Summary of comments received on the Definition draft
  • Grammatical and wording corrections 
    • Some minor grammatical suggestions were made. These change and order the layout slightly differently, though the overall message remains. 
    • One user suggested to explain what Open Source is under the “preamble” and “Why we need open source AI”. Instead of speaking about why Open Source is important, the section should rather be an introduction to what it is and why it matters for AI.
    • Under “Preferred form to make modifications to machine-learning systems” and “data information”, clarification is needed regarding “the training data set used”. It is not clear whether this means that all training data must be open source for the whole model to be.
      • Stefano Maffulli added here that the intention is to know what dataset was used, not to necessarily have it made available, and that it indeed seems to need clarification
  • Technical points
    • Under “Preferred form to make modifications to machine-learning systems” the release of checkpoints is mentioned as an example of required components, under “model parameters”. An objection was raised, arguing that this poses an unnecessary burden: It’d be like requiring that for software to be Open Source, it should include past versions of the program.
      • Maffulli reiterated that this was merely an example but that this might need to be a submission to the FAQ page
    • “Preferred form to make modifications to machine-learning systems” and “data information”, a “skilled person” is mentioned in the context of requiring sufficient information about the training data used to create a model. Question regarding why skill has to do with acquiring data
      • Clarification was given by Maffulli, pointing out that this is in the context of getting information about the data so that a “skilled person” can use, study, share and modify the AI system.
      • A user suggested that this confusion can be solved by changing the context of the wording “a skilled person can recreate”. From “using the same or similar data” to “if able to gain access to the same or similar data”.
      • A user points out that “skilled person” as a legal term used in patent law might not be appropriate as it has different legal connotations and precedence in different countries.
  • Discussion on why specifically we focus on machine learning (ML) as an AI system
    • A question was raised regarding why we explicitly mention ML systems under “preferred form to make modification to an ML system” and subsequently the “checklist”, pointing out that not all AI systems are ML.
      • Maffulli replied that we address ML as they need special and urgent attention as rule-based AI systems can fit under the open source definition. This needs to be addressed in the FAQ
Town hall announcement 
  • The 9th town hall meeting was held on the 3d of May. Access the recording here if you missed it!
Categories: FLOSS Research

CRA standards request draft published

Open Source Initiative - Thu, 2024-05-02 08:19

The European Commission recently published a public draft of the standards request associated with the Cyber Resilience Act (CRA). Anyone who wants to comment on it has until May 16, after which comments will be considered and a final request to the European Standards Organizations (ESOs) will be issued. This process is all governed by regulation 2012/1025, which will be discussed in a future post.

The publication of this draft is important for every entity that will have duties under the CRA, namely “manufacturers” and “software stewards.” Conformance with the harmonized standards that emerge from this process will allow manufacturers to CE-mark their software on the presumption it complies with the requirements of the CRA, without taking further steps.

For those who depend on incorporating or creating Open Source software, there is an encouraging new development found here. For the first time in a European standards request, there is an express requirement to respect the needs of Open Source developers and users. Recital 10 tells each standards organization the following:

“where relevant, particular account should be given to the needs of the free and open source software community”

That is made concrete in Article 2 which specifies:

“The work programme shall also include the actions to be undertaken to ensure effective participation of relevant stakeholders, such as small and medium enterprises and civil society organizations, including specifically the open source community where relevant”

Article 3 requires proof that effective participation has been facilitated. The community is going to have to step up to help the ESOs satisfy these requirements—or corporations claiming to speak for the community will do it instead.

OSI applauds the Commission’s steps to include the Open Source community and will be pleased to work with the European standards organizations towards that initial goal of effective representation and consultation. Additionally, the OSI will:

  • Work with our Affiliates to identify additional suitable participants with relevant skills and experience, and make connections between them and the ESOs.
  • Assist the Commission in validating responses to Article 3.

Our goal is to ensure that the development and use of Open Source software is at best facilitated and at worst not obstructed by any aspect of the standards development process, the resulting harmonized standards, and the access and IPR terms of those standards.

This post may be discussed on our forum

Categories: FLOSS Research

Open Source AI Definition – Weekly update April 29

Open Source Initiative - Mon, 2024-04-29 07:59
New draft of the Open Source AI Definition v.0.0.8 is live!
  • The draft is ready for feedback
  • The changelog: 
    • incorporated feedback from legal review in Gothenburg and 0.0.7
      • transformed Data transparency to Data information following feedback from the
      • separated the Out of scope section to a FAQ document 18
      • added mention of frictionless in the preamble
      • moved the definition of preferred form to make modifications to ML above the checklist
    • updated language to follow the latest version of the Model Openness Framework
    • added the legal requirements for optional components
    • the first incarnation of the FAQ added
  • The next steps now include:
    • Widen the call for reviewers in the next couple of weeks
    • Test the Definition with more AI systems (Olmo, Phy, Mistral, etc)
    • Run a review workshop at PyCon US
Initial reactions 
  • Question regarding why under “Preferred form to make modifications to machine-learning systems” and “model”, mention of model weights has been removed. 
Vote on how to describe the acceptable terms to receive documentation?
  • As part of the next steps, we are continuing to review legal documents from different AI systems to test our definition. Should we describe the terms listed on the 0.0.8 draft under “checklist to evaluate machine learning systems”, should we consider them OSD Compliant or OSD Compatible?
    • This matters as it has different implications for documentation for the components in the class of Data transparency: There is no formal definition of “open documentation” and the OSI hasn’t reviewed licenses used for documentation.
  • A user has concerns with both, stating that:
    • OSD-compliant means that documentation need to be under a license that fulfils all ten OSD criteria, and many of those are quite software-specific. This could be tricky, there is a reason why OSI hasn’t approved (m)any non-software licenses thus far.in its meaning. Many proprietary licenses are compatible with many (non-copyleft) OSD-compliant licenses, that It can lose its meaning.
  • Maffulli replies stating that:
    • The main difference he sees lie in their perceived legal strictness, where “Compatible suggests a lightweight review that anyone can do”
    • He further suggests that OSI could create a special category of licenses for documentation only. When stating that documentation of Open Source AI needs to be available with OSD-compliant terms, do we need to create a special category of OSI Approved Licenses for documentation?
    • He further adds that he reads “compliant”, not in terms of existing licenes but rather in terms of the checklist
  • Regarding creating a “special category of license for documentation only, a user adds:
    • “We need that the documentation is free from restrictions that would limits its circulation, including by requiring seeking additional permission or requiring royalties or requiring audited distribution or the likes.” and its scope therefore is quite limited.
FAQ document has been created 
  • An FAQ needs to be written to address concerns heard often during the drafting process. The document is a work in progress and is waiting for contributions.
See if OSI is coming near you to host a workshop
  • The Open Source AI Definition is going on tour to get a wide array of reviews. This is important to ensure through reviews and secure global significance. Check the dates of the roadshow.
Categories: FLOSS Research

Openly Shared

Open Source Initiative - Fri, 2024-04-26 08:02

The definition of “open source” in the most recent version (article 2(48)) of the Cyber Resilience Act (CRA) goes beyond the Open Source Definition (OSD) managed by OSI. It says:

“Free and open-source software is understood as software the source code of which is openly shared and the license of which provides for all rights to make it freely accessible, usable, modifiable and redistributable.”

The addition of “openly shared” was a considered and intentional addition by the co-legislators – they even checked with community members that it did not cause unintended effects before adding it. While open source communities all “openly share” the source code of their projects, the same is not true of some companies, especially those with “open core” business models.

For historical reasons, it is not a requirement either of the OSD or of the FSF’s Free Software Definition (FSD) and the most popular open source licenses do not require it. Notably, the GPL does not insist that source code be made public – only that those receiving the binaries must be able to request the corresponding source code and enjoy it however they wish (including making it public).

For most open source projects and their uses, the CRA’s extra requirement will make no difference. But it complicates matters for companies that either restrict source availability to paying customers (such as Red Hat) or make little distinction between available and non-available source (such as ForgeRock) or withhold source to certain premium elements.

A similar construct{1} is used in the AI Act (recital 102) and I anticipate this trend will continue through other future legislation. Personally I welcome this additional impetus to openness.

This post may be discussed on our forum

{1} The mention in the AI Act has a different character to that in the CRA. In the AI Act it is more narrative, restricted to a recital and is a subset of attributes of the license. In this form it actually refers to virtually no OSI-approved licenses. In the CRA the wording part of the formal definition in an Article, so much more impactful, and adds an additional requirement over the basic requirements of licensing.

Categories: FLOSS Research

Open Source AI Definition on the road: Looking back and forward

Open Source Initiative - Tue, 2024-04-23 13:15

With version 0.0.7 of the Open Source AI Definition just published, we are getting very close to a release candidate version in June, as planned. We’ve covered a lot of ground since FOSDEM 2024, where we presented draft 0.0.4. This month we presented at Open Source Summit North America (OSS NA 24) and ran a co-design workshop at the Legal and Licensing Workshop (LLW) in Gothenburg. We’re very close to a “feature complete”: below are the next steps and ideas on how you might get involved.

Opportunities to meet in person

We are taking the draft definition on the road and coming to a town near you! Or, kind of, that is if you live in any of the following cities or happen to be there on the given dates:

  1. North America 
    1. USA, Pittsburgh, PyCon US (May 17)
    2. USA, NYC OSPOs for Good (July 9-11)
    3. USA, Raleigh, All Things Open (October 27-29)
  2. Europe
    1. France, Paris, OW2 (June)
    2. France, Paris, data governance event (September)
  3. Africa
    1. Nigeria, Lagos, Sustain Africa (June)
  4. Latin America
    1. Argentina, Buenos Aires, Nerdearla (September 24-28)
  5. Asia Pacific
    1. Hong Kong, AI_Dev (August 23)

It’s important for you to catch up.

Draft v.0.0.5 at FOSDEM 2024

The talk “Moving a Step Closer to Defining Open Source AI” (click here to watch the recorded live stream) by Stefano Maffulli presented draft v.0.0.5, released a few days before. The process at the time was focusing on finding the required components to “use, study, share and modify” an AI system. 

Maffulli quickly summarized why OSI started the Deep Dive: AI process, after Copilot not only demonstrated machines’ ability to write functioning code but also highlighted the new role of data as input to the machine learning system. Recognizing there is no simple answer to the question “what is the source code of Copilot?” Maffulli focused OSI’s attention to finding the Open Source principles applied to AI together with stakeholders from academia, legal communities, tech companies, and civil rights groups.

Building the framework

OSI defined a process to co-design the Open Source AI Definition in public. This framework encompasses a clear definition of AI systems, a preamble outlining the rationale behind open source AI, a concise articulation of the freedoms users should enjoy, and a checklist for evaluating AI components and associated legal documents.

He highlighted the rapid progress and policy decisions that shaped the trajectory of software development, emphasizing the need to compress decades of evolution into a few months in the realm of AI. Stefano emphasized the importance of community feedback and collaboration in refining the definition of Open Source AI. With monthly draft releases, bi-weekly town halls, and an active forum, we gather diverse perspectives and insights to craft a robust definition.

OSS North America 2024 and next steps

Since FOSDEM, the Definition has reached version 0.0.7. First, working groups analyzed Pythia, OpenCV, Llama2 and Bloom  to find the preferred form of making modifications to the AI system, the fundamental unit for users to exercise their freedoms. Later, the groups shifted focus to reviewing the legal frameworks used by the components used by Pythia, OpenCV, Llama2 and Bloom. Together with the definition of AI system provided by the OECD, the preamble, out-of-scope issues and four freedoms, this draft looks very close to a full document. A new version is expected to be released very soon now. On the 16th of April, Ofer Hermoni of the Linux Foundation and Mer Joyce (OSI/DoBigGood) presented the work at the OSS NA 24 meeting in Seattle. A huge part of our job currently is getting this definition reviewed by as many stakeholders as possible. A far-reaching and diverse perspective is necessary as we aim for a global impact. 
To participate in shaping the definition of Open Source AI and stay updated on the latest developments, visit opensource.org/deepdive and engage with the ongoing discussions, participate and watch previous town hall meetings and draft releases. Go to discuss.opensource.org to participate in our forum.

Categories: FLOSS Research

Open Source AI Definition – Weekly update April 22

Open Source Initiative - Mon, 2024-04-22 10:42
Comments on the forum
  • A user added in the forum that there is an issue as traditional copyright protection might not apply to weight models because they are essentially mathematical calculations. “ licensing them through any kind of copyright license will not be enforceable !! and this means that anybody can use them without any copyright restriction (assuming that they have been made public) and this means that you cannot enforce any kind of provisions such as attribution, no warranty or copyleft” They suggest using contractual terms instead of relying on copyright as a workaround, acknowledgement that this will trigger a larger conversation
Comments left on the definition text
  • Clarification needed under “What is Open Source AI”
  1. Discussion on whether “made available” should be changed to “released” or “distributed”
    1. One user pointed out that “made available” is the most appropriate, as the suggested wordings would be antagonistic and limiting
  2. Continuation of last week’s issue regarding defining who these four freedoms are for, deployers, users or someone else.
    1. Added that a user understands it as “We need essential freedoms to enable users…”
    2. But, then who are we defining as “Users”? Is it the person deploying the AI or the calling prompt?
    3. Another wording is suggested: “Open Source AI is an AI system that is made available under terms that grant, without conditions or restrictions, the rights to…”
  • Clarification is needed under “Preferred form to make modification to a machine learning system”, 
  1. Specifically to the claim: (The following components are not required,) but their inclusion in releases is appreciated.
    1. Clarification regarding whether this means best practice or it’s a mere a suggestion.
    2. Suggestion to change the sentence to “The following components are not required to meet the Open Source AI definition and may be provided for convenience.” This will also “consider if those components are provided, can they be provided under different terms that don’t meet the Open Source AI definition, or do they fall under the same OSI compliant license automatically. “
  2. Question regarding the addition of “may” under data transparency in the 0.0.7 draft definition, which was not included in the 0.0.6 one, considering that the components are described as “required” in the checklist below
    1. (Context: “Sufficiently detailed information on how the system was trained. This may include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics; how the data was obtained and selected, the labelling procedures and data cleaning methodologies.”)
    2. Another user seconds this and further adds that it should be changed to “must”, or something else which is definitive.
Town Hall meeting was held on April 19th

In case you missed it, the with town hall was held last Friday. Access the recordings and slides used here

Categories: FLOSS Research

Pages