The Future of AI is Open-source as Dolly 2.0, RedPajama Released
- By Paul Mah
- April 19, 2023
AI might be having a watershed moment, as startups, collectives, and academics work together to shift the current trend of closed, proprietary large language models (LLM) in favor of open-source AI.
To date, some of the most powerful AI models such as GPT-4 are commercially exclusive, while others have been released under restrictive licenses and only to a select, vetted group of academic researchers.
Even Meta’s release of its LLaMA (Large Language Model Meta AI) model released in February, while touted as furthering open science, did not offer enough details for models to be independently created.
All this is set to change with the release of new open-source models.
RedPajama
“In many ways, AI is having its Linux moment,” wrote the firm Together in a blog post announcing the release of RedPajama, which it bills as an effort to produce a reproducible, fully-open, leading language model.
RedPajama is based on Meta’s 7-billion parameter LLaMA model, which was trained on a very large dataset of 1.2 trillion tokens carefully filtered for quality.
Speaking to VentureBeat, Vipul Ved Prakash, founder and CEO of Together explained how the team – consisting of other AI firms, researchers, and academics – put RedPajama together.
“All of the data LLaMA was trained on is openly available data, but the challenge was that they didn’t provide the actual data set – there’s a lot of work to go from the overview to the actual data set,” he said.
However, while the paper might describe how the best 10,000 documents were picked from a million documents, the details were not offered. Prakash said: “So we followed the recipe to repeat all that work to create an equivalent dataset.”
Prakash noted that broader access will open the door to “a lot of brilliant people” around the world to further explore LLM architecture, training algorithms, and research the safety of AI.
RedPajama is licensed under Apache 2.0 and all data pre-processing and quality filters for it are available on GitHub here.
Dolly 2.0
Databricks, which recently released the open-source Dolly LLM has launched Dolly 2.0 just weeks later, which it bills as the “world’s first open-source, instruction-following large language model (LLM), fine-tuned on a human-generated instruction dataset licensed for commercial use”.
Dolly 2.0 is a 12B parameter language model based on the EleutherAI Pythia model family and fine-tuned with a high-quality human-generated instruction-following dataset crowdsourced from Databricks employees.
Specifically, it is the fruit of efforts from more than 5,000 Databricks employees over the last two months. As a result, the databricks-dolly-15k model contains 15,000 high-quality human-generated prompt or response pairs specifically designed for instruction tuning large language models.
According to Databricks, the licensing terms for Dolly 2.0 allows anyone to use, modify, or extend the dataset for any purpose, including commercial applications.
“Dolly 2.0 is a game changer as it enables all organizations around the world to build their own bespoke models for their particular use cases to automate things and make processes much more productive in the field they’re in,” said Ali Ghodsi, the chief executive officer of Databricks.
“With Dolly 2.0, any organization can create, own, and customize a powerful LLM to create a competitive advantage for their business,” he said.
The model weights, databricks-dolly-15k dataset, and helpful code samples for Dolly 2.0 can be accessed from the Databricks Hugging Face page here.
Onwards through open-source
The current generative of AI models did not happen overnight but is the result of many years of behind-the-scenes research. As it converges in recent months with jaw-dropping instruction-following capabilities, however, many new players from technology giants to AI startups have jumped into the game.
But while all major AI tools are open-sourced, I wrote previously that the race to develop more capable AI may see organizations holding back. The release of LLMs such as Dolly 2.0 and RedPajama will likely tilt the scale back towards the democratization of AI for the enterprise.
I think Together summed it up well when it wrote in its blog: “The most capable foundation models today are closed behind commercial APIs, which limits research, customization, and their use with sensitive data. Fully open-source models hold the promise of removing these limitations.”
Of course, that will only happen “if the open community can close the quality gap between open and closed models“.
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: Dall-E 2
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.