How does Netflix's recommendation algorithm work?

Netflix uses a three-layer architecture: offline systems train deep collaborative filtering models on viewing history, nearline systems update user embeddings seconds after interactions, and online systems respond in milliseconds using multi-armed bandits to balance exploration of new content against exploitation of known preferences.

What is the exploration-exploitation tradeoff in recommendations?

The exploration-exploitation tradeoff is the balance between recommending items users will likely enjoy (exploitation) versus surfacing new content they might not discover otherwise (exploration). Netflix measures this through "incrementality"—the causal effect of showing a recommendation compared to not showing it.

Why can't streaming services afford smarter AI recommendations?

LLM-based recommendations cost orders of magnitude more than classical methods. A single LLM recommendation can consume thousands of tokens, while traditional collaborative filtering uses simple dot products costing fractions of a cent. At Netflix and Spotify scale, this cost difference makes full LLM inference economically impossible for every recommendation.

Hybrid Recommender Systems 2026: Bandits vs LLM Costs

Q: What is a hybrid recommender system?

A hybrid recommender system combines multiple recommendation approaches, typically using fast, cheap methods like collaborative filtering for candidate generation from millions of items, then invoking expensive LLM reasoning only for the final items users see. This funnel pattern balances recommendation quality against inference costs.

Q: How does Spotify's AI DJ work?

Spotify's AI DJ uses an "agentic router" that decides whether to invoke expensive LLM reasoning or fall back to fast keyword matching against collaborative filtering embeddings. Complex queries get the big model while simple ones get the fast path. Underneath sits a bandit framework called BaRT that balances familiar and new music recommendations.

Hyperscalers spent over $350 billion on AI infrastructure in 2025 alone, with projections exceeding $500 billion in 2026. The trillion-dollar question is not whether machines can reason, but whether anyone can afford to let them. Hybrid recommender systems sit at the center of this tension. Large Language Models promised to transform how Netflix suggests your next show or how Spotify curates your morning playlist. Instead, the industry has split into two parallel universes, divided not by capability but by cost.

On one side sits what engineers call the “classical stack”: matrix factorization, two-tower embedding models, and contextual bandits. These methods respond in microseconds, scale linearly with users, and run on nothing more complicated than dot products. A query costs a fraction of a cent. On the other side is the “agentic stack”: LLM-based reasoning engines that can handle requests like “find me a sci-fi movie that feels like Blade Runner but was made in the 90s.” This second approach consumes thousands of tokens per recommendation. The cost difference is not incremental; it is orders of magnitude. LLM inference cost economics, more than any algorithmic breakthrough, is now the dominant force shaping recommender architecture.

The 2026 consensus is a hybrid architecture: use the cheap, fast models for candidate generation from millions of items, then invoke the expensive reasoning layer only for the final dozen items a user actually sees. This “funnel” pattern — retrieval, then ranking, then re-ranking — is the only way to make the economics work. The smartest model is reserved for the fewest items.

What makes this work in practice goes back to a formalism from 1933: the multi-armed bandit. Imagine a gambler facing a row of slot machines, each with an unknown payout rate. She wants to maximize her winnings over a night of play. If she always pulls the arm with the highest observed payout, she might miss a better machine she never tried. If she explores too much, she wastes money on losers. The mathematics of this exploration–exploitation tradeoff define regret:

$$ R(T) = \mu^* \cdot T - \sum_{t=1}^{T} \mu(a_t) $$

Here μ* is the best possible average reward, and μ(aₜ) is the reward from whatever arm she actually pulled at time t. Total regret is how much she left on the table by not knowing the optimal choice in advance. The goal of every multi-armed bandit algorithm in recommender systems is to drive this quantity sublinear in T — to learn fast enough that the cost of exploration vanishes relative to the horizon.

Multi-armed bandit recommender system diagram: a Learner taking Actions and receiving Rewards from an Environment, with the goal to maximize cumulative reward or minimize cumulative regret

The three main exploration strategies each take a different approach: epsilon-greedy adds random noise to avoid getting stuck; Upper Confidence Bound (UCB) prefers actions with uncertain values; Thompson Sampling selects actions according to the probability they are optimal. In practice, Thompson Sampling tends to outperform the others because its exploration is guided by posterior uncertainty rather than arbitrary randomness — it explores where it matters most.

Principles of Exploration in recommender systems: Naive Exploration (ε-greedy), Optimism in the Face of Uncertainty (UCB), and Probability Matching (Thompson Sampling)

Every recommendation you see on Netflix’s homepage is the output of an algorithm trying to minimize exactly this quantity, whether it realizes it or not.

Netflix’s recommendation algorithm architecture runs this optimization across three computation layers. Offline systems crunch terabytes of viewing history to train deep collaborative filtering models, a process that takes hours and happens on a schedule. Nearline systems update user embeddings seconds after a click, keeping the recommendations fresh without the cost of full retraining. Online systems respond to each page load in milliseconds, combining the precomputed signals with real-time context like time of day and device type. The architecture is a latency-cost tradeoff: deep analysis happens in batch, while the user-facing layer stays fast.

Netflix recommendation algorithm architecture: Member Activity and Contextual Information flow through an Offline System for model training, then to an Online System where the Multi-Armed Bandit produces recommendations

What Netflix learned from a decade of experimentation is counterintuitive. The goal is not to recommend what users will definitely watch, but what they would not have found on their own. They call this “incrementality.” A greedy algorithm that always surfaces the highest-probability titles just confirms what users already knew — it exploits without exploring, and in doing so collapses the discovery space. A better approach is to measure the causal effect of the recommendation: how much does showing this thumbnail increase the probability of a play compared to not showing it? Some titles have low baseline interest but high incrementality. Those are the ones worth featuring. This is the exploration–exploitation tradeoff made concrete: the value of a recommendation is not its predicted rating, but its marginal contribution to discovery.

Netflix incrementality analysis: scatter plot showing incremental probability vs baseline probability, where Title A has low baseline but high incremental lift, while Title C has high baseline but less benefit from featuring

Spotify’s AI DJ recommender system takes a different approach to the same problem. Their “AI DJ” feature uses what engineers internally call the “agentic router.” When you ask for “music for a rainy reading session in 1990s Seattle,” the router decides whether to invoke the expensive LLM reasoning layer or just fall back to keyword matching against collaborative filtering embeddings. Complex queries get the big model; simple ones get the fast path. This router is the economic governor of the entire system — an inference cost optimizer disguised as a product feature. Underneath the DJ’s personality, built on Spotify’s Sonantic voice synthesis and LLM-generated contextual narratives, sits a bandit framework called BaRT (Bandits for Recommendations as Treatments) that quietly balances what you know you like against what you might not yet know you need.

Not everyone is convinced the algorithms are making us better off. My own analysis of social media success prediction found that sophisticated language models often just memorize temporal patterns rather than learning what actually makes content good. They learn the news cycle, not the news.

The risk is that we build hybrid recommender systems that are technically brilliant but experientially hollow, engineering away the serendipity that made discovery meaningful in the first place. The recommender is becoming a curator, and the curator is becoming an agent. The architecture will keep evolving — foundation models for recommendations, reinforcement learning from human feedback applied to discovery, inference costs that continue their 10× annual decline — but the open question for 2026 is whether we want to be the curators of our own lives, or merely consumers of an optimized feed.

Slides courtesy of “A Multi-Armed Bandit Framework for Recommendations at Netflix” by Jaya Kawale, Netflix.