Guide from MIT reveals how small AI models can predict performance of large LLMs

Researchers collected data from 40 model families and over a thousand candidate scaling laws to derive guidelines for efficient LLM training and cost fidelity.

AI scaling laws, LLM performance prediction, efficient training, parameters vs tokens, intermediate checkpoints, budget maximisation, MIT-IBM Watson lab, small model extrapolation, compute efficiency, absolute relative error

MIT-IBM Watson AI Lab researchers have developed a universal guide for estimating how large language models will perform based on smaller models in the same family.

Scaling law estimation helps organisations make better decisions about architecture, optimisers and dataset sizes before devoting extensive compute budgets.

The team assembled over 485 pre-trained models across 40 families (including Pythia, OPT, Bloom, LLaMA and others) and tracked almost 1.9 million performance metrics. Using that dataset, they fit more than 1,000 scaling laws and assessed variables such as the number of parameters, the token count, intermediate training checkpoints, and seed effects.

Practical recommendations include discarding training data from very early stages (before about 10 billion tokens), using several small models across sizes rather than only large ones, and using intermediate checkpoints rather than waiting for final model loss.

The guide also notes that a 4 percent absolute relative error (ARE) is near best-case for prediction quality, though up to 20 percent ARE remains useful depending on budget.

Because training large models can cost millions, these scaling laws also help those without huge resources to approximate outcomes more safely. AI model inference scaling laws are still under development and are flagged as important future work.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!