NVIDIA Likes Small Language Models

A Small Language Model (SLM) is a LM that can fit onto a common consumer electronic device and perform inference with latency sufficiently low to be practical when serving the agentic requests of one user. […] We note that as of 2025, we would be comfortable with considering most models below 10bn parameters in size to be SLMs.

The (NVIDIA) researchers argue that most agentic applications perform repetitive, specialized tasks that don’t require the full generalist capabilities of LLMs. They propose heterogeneous agentic systems where SLMs handle most tasks while LLMs are used selectively for complex reasoning. They present three main arguments: (1) SLMs are sufficiently powerful for agentic tasks, as demonstrated by recent models like Microsoft’s Phi series, NVIDIA’s Nemotron-H family, and Hugging Face’s SmolLM2 series, which achieve comparable performance to much larger models while being 10-30x more efficient. (2) SLMs are inherently more operationally suitable for agentic systems due to their faster inference, lower latency, and ability to run on edge devices. (3) SLMs are necessarily more economical, offering significant cost savings in inference, fine-tuning, and deployment.

The paper addresses counterarguments about LLMs’ superior language understanding and centralization benefits with studies (see Appendix B: LLM-to-SLM Replacement Case Studies) showing that 40-70% of LLM queries in popular open-source agents (MetaGPT, Open Operator, Cradle) could be replaced by specialized SLMs. One comment I read raised important concerns about the paper’s analysis, particularly regarding context window which are arguably the highest technical barrier to SLM adoption in agentic systems. Modern agentic applications require substantial context: Claude 4 Sonnet’s system prompt alone reportedly uses around 25k tokens, and a typical coding agent needs system instructions, tool definitions, file context, and project documentation, totaling 5-10k tokens before any actual work begins. Most SLMs that can run on consumer hardware are capped at 32k or 128k contexts architecturally, but achieving reasonable inference speeds at these limits requires gaming hardware (8GB VRAM for a 7b model at 128k context).

The paper concludes that the shift to SLMs is inevitable due to economic and operational advantages, despite current barriers including infrastructure investment in LLM serving, generalist benchmark focus, and limited awareness of SLM capabilities. But the economic efficiency claims also face scrutiny under system-level analysis. In Section 3.2 they present simplistic FLOP comparisons while ignoring critical inefficiencies: the reliance on multishot-prompting where SLMs might require 3-4 attempts for tasks that LLMs complete with 90% success rate, task decomposition overhead that multiplies context setup costs and error rates, and infrastructure efficiency differences between optimized datacenters (PUE ratios near 1.1, >90% GPU utilization) and consumer hardware (5-10% GPU utilization, residential HVAC, 80-85% power conversion efficiency). When accounting for failed attempts, orchestration overhead, and infrastructure efficiency, many “economical” SLM deployments might actually consume more total energy than centralized LLM inference.

(05/07/2025) Update: On the topic of speed I just came across Ultra-Fast Language Models Based on Diffusion. You can also test it yourself using the free playground link, and it is in fact extremely fast. Try the “Diffusion Effect” in the top right corner which toggles an interesting visualization. I’m not sure how realistic this is, it shows text appearing as random noise before gradually resolving into clear words; though the actual process likely involves tokens evolving from imprecise vectors in a multidimensional space toward more precise representations until they crystallize into specific words.

(06/07/2025) Update II: Apparently there is also a Google DeepMind Gemini Diffusion Model.