Open Source Weights, Code, and Dataset; Performance Surpassing Mistral-7B; Apple’s Small Model is Here
OpenAI has launched its small model GPT-4o-mini, officially opening the small model competition. Recently, Apple has also joined this race.
Apple, as one of the research institutions for the DataComp-LM (DCLM) project, has released the DCLM-7B open-source model on Hugging Face. This model's performance has already surpassed Mistral-7B and is approaching other leading open-source models, including Llama 3 and Gemma.
A current evaluation challenge for large language models (LLMs) is the lack of controlled comparisons. LLM research often compares models that use different architectures, computations, or hyperparameters, making it difficult to clarify the factors that influence the quality of language models.
In response, the research team proposed a new benchmark for comparing language model data—DCLM. This is the first benchmark for the curation of training data for language models, aimed at improving model performance by designing high-quality datasets, especially in the multimodal domain.
The research team found that model-based filtering, which involves using machine learning (ML) models to automatically filter and select high-quality data from larger datasets, could be key to building high-quality training sets.
The overall idea of DCLM is straightforward: use a standardized framework for experimentation, including fixed model architectures, training code, hyperparameters, and evaluations, to ultimately determine which data curation strategy is best suited for training high-performance models.
Using DCLM, the research team constructed a high-quality dataset called DCLM-BASELINE and trained a 7B parameter model—DCLM-7B—from scratch using this dataset.
DCLM-7B employs a pre-training scheme based on the OpenLM framework, achieving a 5-shot accuracy of 64% on the MMLU benchmark, comparable to Mistral-7B-v0.3 (63%) and Llama 3 8B (66%). Additionally, its average performance across 53 natural language understanding tasks is also comparable to Mistral-7B-v0.3 and Llama 3 8B, while requiring only 1/6 of the computational resources of Llama 3 8B.
Most other models have open weights but closed data. This is why Vaishaal Shankar describes the DCLM model as "truly open source."