Dolma: Another Large Language Model
October 9, 2024
The biggest complaint AI developers have are the lack of variety and diversity in large language models (LLMs) to train the algorithms. According to the Cornell University computer science paper, “Dolma: An Open Corpus Of There Trillion Tokens For Language Model Pretraining Research” the LLMs do exist.
The paper’s abstract details the difficulties of AI training very succinctly:
“Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations.”
Due to the lack of LLMs, the paper’s team curated their own model called Dolma. Dolma is a three-trillion-token English opus. It was built on web content, public domain books, social media, encyclopedias code, scientific papers, and more. The team thoroughly documented every information source so they wouldn’t deal with the same problems of other LLMs. These problems include stealing copyrighted material and private user data.
Dolma’s documentation also includes how it was built, design principles, and content summaries. The team share Dolma’s development through analyses and experimental test results. They are thoroughly documenting everything to guarantee that this is the ultimate LLM and (hopefully) won’t encounter problems other than tech related. Dolma’s toolkit is open source and the team want developers to use it. This is a great effort on behalf of Dolma’s creators! They support AI development and data curation, but doing it responsibly.
Give them a huge round of applause!
Cynthia Murrell, October 10, 2024
Comments
Got something to say?