Hyperscalers spent over $350 billion on AI infrastructure in 2025 alone, with projections exceeding $500 billion in 2026. The trillion-dollar question is not whether machines can reason, but whether anyone can afford to let them. Hybrid recommender systems sit at the center of this tension. Large Language Models promised to transform how Netflix suggests your next show or how Spotify curates your morning playlist. Instead, the industry has split into two parallel universes, divided not by capability but by cost.
On one side sits what engineers call the “classical stack”: matrix factorization, two-tower embedding models, and contextual bandits. These methods respond in microseconds, scale linearly with users, and run on nothing more complicated than dot products. A query costs a fraction of a cent. On the other side is the “agentic stack”: LLM-based reasoning engines that can handle requests like “find me a sci-fi movie that feels like Blade Runner but was made in the 90s.” This second approach consumes thousands of tokens per recommendation. The cost difference is not incremental; it is orders of magnitude. LLM inference cost economics, more than any algorithmic breakthrough, is now the dominant force shaping recommender architecture.
The 2026 consensus is a hybrid architecture: use the cheap, fast models for candidate generation from millions of items, then invoke the expensive reasoning layer only for the final dozen items a user actually sees. This “funnel” pattern — retrieval, then ranking, then re-ranking — is the only way to make the economics work. The smartest model is reserved for the fewest items.
What makes this work in practice goes back to a formalism from 1933: the multi-armed bandit. Imagine a gambler facing a row of slot machines, each with an unknown payout rate. She wants to maximize her winnings over a night of play. If she always pulls the arm with the highest observed payout, she might miss a better machine she never tried. If she explores too much, she wastes money on losers. The mathematics of this exploration–exploitation tradeoff define regret:
$$ R(T) = \mu^* \cdot T - \sum_{t=1}^{T} \mu(a_t) $$Here μ* is the best possible average reward, and μ(aₜ) is the reward from whatever arm she actually pulled at time t. Total regret is how much she left on the table by not knowing the optimal choice in advance. The goal of every multi-armed bandit algorithm in recommender systems is to drive this quantity sublinear in T — to learn fast enough that the cost of exploration vanishes relative to the horizon.
Netflix’s recommendation algorithm architecture runs this optimization across three computation layers. Offline systems crunch terabytes of viewing history to train deep collaborative filtering models, a process that takes hours and happens on a schedule. Nearline systems update user embeddings seconds after a click, keeping the recommendations fresh without the cost of full retraining. Online systems respond to each page load in milliseconds, combining the precomputed signals with real-time context like time of day and device type. The architecture is a latency-cost tradeoff: deep analysis happens in batch, while the user-facing layer stays fast.
Not everyone is convinced the algorithms are making us better off. My own analysis of social media success prediction found that sophisticated language models often just memorize temporal patterns rather than learning what actually makes content good. They learn the news cycle, not the news.
The risk is that we build hybrid recommender systems that are technically brilliant but experientially hollow, engineering away the serendipity that made discovery meaningful in the first place. The recommender is becoming a curator, and the curator is becoming an agent. The architecture will keep evolving — foundation models for recommendations, reinforcement learning from human feedback applied to discovery, inference costs that continue their 10× annual decline — but the open question for 2026 is whether we want to be the curators of our own lives, or merely consumers of an optimized feed.
Slides courtesy of “A Multi-Armed Bandit Framework for Recommendations at Netflix” by Jaya Kawale, Netflix.