Latest Posts

The Variance Tax

Let’s say your portfolio returned +60% in 2024, then fell 40% in 2025. That’s an annualized average return of +10%. Actual return after two years: minus 4% (i.e $100 * 1.6 * 0.6 = $96).

That 14-point gap is what we call the variance tax aka variance drain or volatility drag and it’s probably one of the most underappreciated forces in investing.

Take any series of returns with arithmetic mean μ and volatility σ. The compound growth rate, the one that actually determines your wealth, is approximately:

$$G ≈ μ − ½σ²$$

This comes from a second-order Taylor expansion of ln(1+r). Take expectations, and the mean log return equals the arithmetic mean minus half the variance. Everything else drops out. Half the variance. That is the tax. The same correction term appears when you solve geometric Brownian motion via Itô’s lemma (the drift of log(S) is μ − σ²/2, not μ) so whether you come at it from discrete compounding or continuous-time stochastic calculus, you land in the same place. And because it is quadratic, doubling volatility does not double the cost. It quadruples it. And what we learned during covid, if anything at all, is that we generally have a hard time to mentally abstract exponential growth rates. Chart showing the variance tax as a quadratic curve ½σ², with labeled data points for Bonds (5% vol, 0.1% drain), S&P 500 (16%, 1.3%), Nasdaq (22%, 2.4%), Emerging Markets (25%, 3.1%), 2x Leveraged S&P (32%, 5.1%), 3x Leveraged S&P (48%, 11.5%), and Bitcoin (60%, 18%) Chart showing the variance tax as a quadratic curve ½σ², with labeled data points for Bonds (5% vol, 0.1% drain), S&P 500 (16%, 1.3%), Nasdaq (22%, 2.4%), Emerging Markets (25%, 3.1%), 2x Leveraged S&P (32%, 5.1%), 3x Leveraged S&P (48%, 11.5%), and Bitcoin (60%, 18%) Treasury bonds at 5% vol pay about 0.1% per year in variance drain. Barely noticeable. The S&P 500 at 16% vol pays 1.3%. A 3x leveraged ETF at 48% vol pays 11.5%. Jacquier, Kane, and Marcus (2003) studied S&P 500 returns from 1926 to 2001: arithmetic mean 12.49%, geometric mean 10.51%. The gap is 1.98 percentage points. The formula predicts ½ × 0.203² = 2.06%. Table showing variance drain by asset class: US Bonds (5% vol, 0.1% drain), S&P 500 (16% vol, 1.3% drain), Nasdaq (22% vol, 2.4% drain), Emerging Markets (25% vol, 3.1% drain), 2x Leveraged S&P (32% vol, 5.1% drain), 3x Leveraged S&P (48% vol, 11.5% drain) Table showing variance drain by asset class: US Bonds (5% vol, 0.1% drain), S&P 500 (16% vol, 1.3% drain), Nasdaq (22% vol, 2.4% drain), Emerging Markets (25% vol, 3.1% drain), 2x Leveraged S&P (32% vol, 5.1% drain), 3x Leveraged S&P (48% vol, 11.5% drain) Looking at the last row, we see that tripling leverage triples the arithmetic return but delivers nearly the same compound return as 2x. The linear gain gets eaten by the quadratic penalty. Line chart showing $100 invested at 10% arithmetic return over 30 years at four volatility levels: 0% vol reaches $1,745, 15% vol reaches $1,280, 30% vol reaches $498, and 50% vol loses most of the original investment Line chart showing $100 invested at 10% arithmetic return over 30 years at four volatility levels: 0% vol reaches $1,745, 15% vol reaches $1,280, 30% vol reaches $498, and 50% vol loses most of the original investment Same 10% arithmetic return, different volatility. After 30 years, the zero-volatility path reaches $1,745. At 15% vol, $1,280. At 30%, $498. At 50% vol you have lost more than half your money despite averaging +10% per year.

Now apply leverage. If you lever an asset by factor L, the arithmetic return scales linearly (Lμ) but the variance drain scales quadratically (½L²σ²). The compound return becomes:

$$G(L) ≈ r + L(μ − r) − ½L²σ²$$

Take the derivative, set to zero. The leverage that maximizes compound wealth:

$$L^{\ast} = (μ − r) / σ²$$

For the S&P 500 with roughly 7% excess return and 16% vol, L* comes out to about 2.7x. The leverage curve for S&P 500 parameters showing compound return peaking at Kelly optimal leverage L*=2.7x, with labeled points at 1x, 2x, and 3x leverage. Returns decline beyond the Kelly optimum and eventually turn negative The leverage curve for S&P 500 parameters showing compound return peaking at Kelly optimal leverage L*=2.7x, with labeled points at 1x, 2x, and 3x leverage. Returns decline beyond the Kelly optimum and eventually turn negative This is the Kelly criterion (which you might know from utility theory or gambling heuristics but in fact, as we see here, it falls straight out of the variance tax formula.) Beyond Kelly, every dollar of additional leverage costs more in variance drain than it earns in expected return. The curve bends over and eventually goes negative. In practice, most practitioners use “half-Kelly” — sizing positions at L*/2 — because the formula assumes you know μ and σ precisely, and you don’t. Estimation error in either parameter can push you past the peak and onto the losing side of the curve. Half-Kelly sacrifices roughly 25% of the theoretical growth rate but dramatically reduces drawdown risk. Extract of ProShares UltraPro S&P 500 Factsheet Total Return Extract of ProShares UltraPro S&P 500 Factsheet Total Return You can see this play out in practice. ProShares UPRO, the 3x S&P 500 ETF, has returned roughly 28% annualized over the past decade during one of the strongest bull markets in history. The S&P 500 compounded at about 10% over the same period. Linear 3x leverage would imply roughly 30%. Variance drain accounts for the gap, and that was in a favorable environment. In 2022, when the S&P fell about 19%, UPRO dropped 70%. The effect is even starker in higher-volatility underlyings: ProShares TQQQ, the 3x Nasdaq-100 ETF, sat roughly flat from its 2021 highs through early 2025 while the unlevered QQQ had long since recovered — a textbook case of variance drain overwhelming the leverage premium in a choppy market.

The same half-sigma-squared shows up across finance. It is why stock prices follow log-normal distributions, not normal ones. Why put options cost more than equidistant calls. Why the Black-Scholes d₁ and d₂ terms carry a ½σ²t adjustment. Why a $100 stock’s true geometric midpoint between $150 up and $50 down is not $100 but $86.60, because ln(150/100) = ln(100/66.67). Wherever returns compound and volatility is nonzero, the variance tax is being collected.

Claude Opus 4.6: Anthropic's New Flagship AI Model for Agentic Coding

Anthropic just released Claude Opus 4.6, the latest frontier AI model in the Claude family. It’s a significant upgrade over Opus 4.5 and arguably the most agentic-focused LLM release we’ve seen from any lab this year.

Key upgrades: better agentic AI coding capabilities (plans more carefully, sustains longer tasks, catches its own mistakes), a 1M token context window (a first for Opus-class models), and 128K output tokens. Pricing holds at $5/$25 per million tokens. Claude Opus 4.6 release announcement on claude.ai showing the new flagship model from Anthropic Claude Opus 4.6 release announcement on claude.ai showing the new flagship model from Anthropic

LLM Benchmark Results: How Claude Opus 4.6 Compares

The benchmark numbers are strong across the board. Opus 4.6 hits state-of-the-art on Terminal-Bench 2.0 (65.4% for agentic coding in the terminal), Humanity’s Last Exam (complex multidisciplinary reasoning), and BrowseComp (agentic web search). It beats GPT-5.2 by roughly 144 Elo points on GDPval-AA, the benchmark that measures real-world knowledge work across 44 professional occupations. GDPval-AA Elo benchmark comparison chart: Claude Opus 4.6 at 1,606 Elo vs GPT-5.2 at 1,462 Elo vs Claude Opus 4.5 at 1,416 Elo for real-world knowledge work GDPval-AA Elo benchmark comparison chart: Claude Opus 4.6 at 1,606 Elo vs GPT-5.2 at 1,462 Elo vs Claude Opus 4.5 at 1,416 Elo for real-world knowledge work The standout is ARC-AGI-2, which tests abstract reasoning on problems easy for humans but hard for AI. Opus 4.6 scores 68.8%, a dramatic leap from Opus 4.5’s 37.6%. For comparison, GPT-5.2 scores 54.2% and Gemini 3 Pro hits 45.1%. That gap matters because ARC-AGI-2 resists memorization — it measures whether models can actually generalize.

On coding-specific evaluations, Terminal Bench 2.0 rises to 65.4% (from 59.8% for Opus 4.5), and OSWorld for agentic computer use jumps from 66.3% to 72.7%, putting Opus ahead of both GPT-5.2 and Gemini 3 Pro on those particular tests. SWE-bench Verified shows a small regression — worth watching, though the model excels on the benchmarks that better reflect real production work. Claude Opus 4.6 LLM benchmark comparison: SOTA on Terminal-Bench 2.0, Humanity's Last Exam, BrowseComp, and GDPval-AA with 90.2% on BigLaw Bench Claude Opus 4.6 LLM benchmark comparison: SOTA on Terminal-Bench 2.0, Humanity's Last Exam, BrowseComp, and GDPval-AA with 90.2% on BigLaw Bench

What Can You Do With a 1 Million Token Context Window?

The 1M context window paired with the new context compaction feature is the most practically interesting upgrade. To put it in perspective: 1M tokens covers roughly 750 novels, an entire enterprise codebase of several thousand files, or a full legal discovery set — processed in a single prompt.

Compaction automatically summarizes older context when approaching limits, which means agents can theoretically run indefinitely without hitting the wall that’s plagued long-running AI tasks. Combined with the model’s improved ability to catch its own mistakes through better code review and debugging, you’re looking at agents that can actually finish what they start.

The long-context retrieval jump tells the story. On MRCR v2, which tests whether a model can find and reason over specific facts buried in massive prompts, Opus 4.6 scores 76% compared to Sonnet 4.5’s 18.5%. That’s not an incremental improvement — it’s a different capability class. Long-context retrieval benchmark: Claude Opus 4.6 scores 76% vs Claude Sonnet 4.5 at 18.5% on MRCR v2 needle-in-a-haystack reasoning test Long-context retrieval benchmark: Claude Opus 4.6 scores 76% vs Claude Sonnet 4.5 at 18.5% on MRCR v2 needle-in-a-haystack reasoning test That said, bigger context doesn’t automatically mean better. Research from Factory.ai and others shows attention degrades across very long sequences, and prefill latency at 1M tokens can exceed two minutes before you get your first output token. The premium pricing tier for prompts exceeding 200K tokens ($10/$37.50) reflects this cost — Anthropic isn’t subsidizing power users anymore. The real question for enterprise deployments is whether stuffing your entire codebase into context beats a well-designed RAG pipeline. The answer, as usual, depends on the use case.

Agentic AI Coding: Agent Teams and Claude Code Updates

The headline numbers impress, but the real story is the agentic focus. Anthropic isn’t just making Claude smarter. They’re making it more useful for the actual work people want AI to do: sustained, multi-step tasks in large codebases.

New API features reinforce this direction: adaptive thinking lets the model decide when to reason deeper based on contextual cues, effort controls give developers fine-grained tradeoffs between intelligence, speed, and cost (low/medium/high/max), and context compaction keeps long-running agents within limits without manual intervention.

Claude Code gets the headline feature: Agent Teams that work in parallel. Multiple subagents can coordinate autonomously on read-heavy work like codebase reviews, with each agent handling a different branch via git worktrees before merging back. This ships as a research preview, but it’s clearly aimed at the production workflows where agentic coding tools like Cursor, GitHub Copilot, and OpenAI’s Codex are competing hard. The timing isn’t accidental — Apple just announced Xcode 26.3 with native support for Claude Agent and OpenAI’s Codex via MCP (Model Context Protocol), making agentic coding a standard part of the developer toolchain rather than an experiment.

Enterprise Deployment: Why GDPval-AA Matters

The GDPval-AA benchmark deserves special attention because it measures performance on real-world knowledge work — not toy problems or academic puzzles. Beating GPT-5.2 by 144 Elo points (and Opus 4.5 by 190) suggests meaningful improvements in the tasks that matter for enterprise AI adoption: financial analysis, legal reasoning, and multi-step professional workflows.

The product expansions signal where Anthropic sees the market going. Claude in Excel now handles long-running tasks and unstructured data. Claude in PowerPoint reads layouts and slide masters for brand consistency. These aren’t research demos — they’re enterprise-ready integrations designed for knowledge workers who need AI that fits into existing toolchains.

For teams evaluating which frontier model to standardize on, the picture is nuanced. Claude Opus 4.6 leads on agentic coding and enterprise knowledge work. GPT-5.2 still holds advantages in abstract reasoning (ARC-AGI-2, though the gap narrowed significantly) and math. Gemini 3 Pro offers the best cost efficiency and multimodal processing with its own 1M context window. The multi-model workflow trend is real — the smartest enterprise teams aren’t picking one model; they’re routing tasks to whichever model handles them best.

Safety Profile and the Zero-Day Question

One detail worth noting: the safety profile. Anthropic claims Opus 4.6 is “just as well-aligned as Opus 4.5, which was the most-aligned frontier model to date.” Given the enhanced cybersecurity capabilities — Opus 4.6 independently discovered over 500 previously unknown zero-day vulnerabilities in open-source code during Anthropic’s pre-release testing — they developed six new detection probes specifically for this release.

Whether that’s reassuring or concerning depends on your priors about AI capabilities research. The vulnerabilities ranged from system-crashing bugs to memory corruption flaws in widely-used tools like GhostScript and OpenSC. As Logan Graham, head of Anthropic’s frontier red team, put it: it’s a race between defenders and attackers, and Anthropic wants defenders to have the tools first.

What This Means for the Competitive Landscape

The competitive landscape just got more interesting. GPT-5.2 and Gemini 3 Pro now have a new benchmark to chase, and Anthropic has clearly staked its claim on agentic coding as the primary battleground. With pricing unchanged at $5/$25 per million tokens — significantly more expensive than GPT-5.2 at $2/$10 but competitive for the performance tier — the value proposition comes down to whether the agentic improvements translate to fewer retries, less hand-holding, and faster task completion in your specific workflow.

For developers, the move is straightforward: swap in claude-opus-4-6 via the API and test it on your hardest tasks. For enterprise decision makers, the GDPval-AA results and Agent Teams feature are worth a serious evaluation cycle. The model is available now on claude.ai, the API, and all major cloud platforms (AWS Bedrock, Azure Foundry, GCP Vertex AI).

Buying the Haystack Might Not Work This Year

I’ve been reading the January 2026 state of markets reports from Andreessen Horowitz and AQR, and their conclusions on the AI bubble question in 2026 are almost impossible to reconcile.

The a16z view is straightforward: AI fundamentals are real, and current prices reflect that reality. Their evidence is compelling. The top 50 private AI companies now generate $40.6 billion in annual revenue. Companies like ElevenLabs and Cursor are hitting $100 million ARR faster than Slack or Twilio ever did. GPUs are running at 80% utilization, compared to the 7% utilization rate for fiber optic cables during the dotcom bubble. This isn’t speculation, they argue. It’s demand exceeding supply. GPU utilization at 80% in AI datacenters compared to just 7% fiber optic cable utilization during the early 2000s dotcom bubble GPU utilization at 80% in AI datacenters compared to just 7% fiber optic cable utilization during the early 2000s dotcom bubble AQR looks at the same market and sees something else entirely. Their capital market assumptions put the U.S. CAPE ratio at the 96th percentile since 1980. Expected real returns for U.S. large cap equities over the next 5-10 years? 3.9%. For a global 60/40 portfolio, just 3.4%, well below the long-term average of roughly 5% since 1900. Risk premia, in their framework, are compressed across nearly every asset class. The narrative doesn’t enter their models. AQR medium-term expected real returns summary showing U.S. equities at 3.9%, non-U.S. developed at 5.3%, and global 60/40 at 3.4% AQR medium-term expected real returns summary showing U.S. equities at 3.9%, non-U.S. developed at 5.3%, and global 60/40 at 3.4% a16z points to earnings growth. The market rally hasn’t been driven by multiple expansion, they note, but by actual EPS growth. Tech P/E multiples sit around 30-35x, elevated but nowhere near the 70-80x of 2000. Tech margins have “lapped the field” at 25%+ compared to 5-8% for the rest of the S&P 500. The fundamentals, they insist, are doing the work. Earnings multiples are high but nowhere near dotcom levels: large cap tech trailing P/E around 30-35x today versus 70-80x in 2000 Earnings multiples are high but nowhere near dotcom levels: large cap tech trailing P/E around 30-35x today versus 70-80x in 2000 Tech margins have lapped the field: Tech and Interactive Media at 25%+ compared to 5-8% for the rest of the S&P 500 Tech margins have lapped the field: Tech and Interactive Media at 25%+ compared to 5-8% for the rest of the S&P 500 AQR’s response would be that fundamentals always look good near peaks. Their research shows a 50% probability that realized equity returns will miss estimates by more than 3 percentage points annually over the next decade. Compressed premia don’t announce themselves with blaring headlines. They just quietly erode returns until investors notice they’ve been running in place.

Cumulative hyperscaler capex is projected to reach $4.8 trillion by 2030. To achieve a 10% hurdle rate on that investment, AI revenue needs to hit roughly $1 trillion annually by 2030, about 1% of global GDP excluding China. Goldman Sachs estimates that $9 trillion in revenue could flow from the AI buildout, which at 20% margins and a 22x P/E multiple would create $35 trillion in new market cap. Only about $24 trillion has been pulled forward so far, leaving $11 trillion “on the table.“ Required AI-enabled revenue to meet return on capital targets: cumulative AI investment reaching $4.8 trillion by 2030 requires roughly $1 trillion in annual AI revenue Required AI-enabled revenue to meet return on capital targets: cumulative AI investment reaching $4.8 trillion by 2030 requires roughly $1 trillion in annual AI revenue Or not. AQR would point out that the expected return for U.S. buyouts, private equity’s bread and butter, is now 4.2%. That’s barely above the 3.9% for public large caps. The illiquidity premium has essentially vanished. If sophisticated PE firms can’t find excess returns, why should AI capex be different?

I find myself uncertain, which feels like the more honest position. Neither source is disinterested. a16z manages billions in venture capital and growth equity; bullish AI narratives support their portfolio valuations and fundraising. AQR runs systematic strategies that benefit when investors diversify away from concentrated U.S. tech exposure toward international equities and alternatives. Both are talking their book, which doesn’t make either wrong, but it’s worth noting.

The a16z data on utilization and revenue growth is hard to dismiss. 80% GPU utilization isn’t vaporware. Harvey users nearly tripled their time on the platform in nine months. Navan’s AI handles half of all customer interactions at satisfaction levels matching human agents. These are real products generating real engagement. But AQR’s valuation work has a longer track record. Their models don’t care about narratives, and historically that discipline has been valuable. When they say U.S. equities offer the lowest expected returns among major markets, that’s not pessimism. It’s arithmetic.

The reconciliation might be this: AI winners could thrive spectacularly while broad market indices disappoint. a16z’s portfolio companies operate in a different universe than the average S&P 500 constituent. Compressed risk premia can coexist with individual companies generating enormous returns. The question is whether you’re buying the index or picking the winners.

Non-U.S. developed markets, by the way, offer expected returns of around 5%, versus 3.9% for U.S. large caps. The valuation gap is real even if the AI story is true. AQR expected local real returns for equities: U.S. Large at 3.9%, Eurozone at 5.0%, UK at 4.9%, Japan at 4.9% AQR expected local real returns for equities: U.S. Large at 3.9%, Eurozone at 5.0%, UK at 4.9%, Japan at 4.9%

Bandits and Agents: Netflix and Spotify Recommender Stacks in 2026

Hyperscalers spent over $350 billion on AI infrastructure in 2025 alone, with projections exceeding $500 billion in 2026. The trillion-dollar question is not whether machines can reason, but whether anyone can afford to let them. Hybrid recommender systems sit at the center of this tension. Large Language Models promised to transform how Netflix suggests your next show or how Spotify curates your morning playlist. Instead, the industry has split into two parallel universes, divided not by capability but by cost.

On one side sits what engineers call the “classical stack”: matrix factorization, two-tower embedding models, and contextual bandits. These methods respond in microseconds, scale linearly with users, and run on nothing more complicated than dot products. A query costs a fraction of a cent. On the other side is the “agentic stack”: LLM-based reasoning engines that can handle requests like “find me a sci-fi movie that feels like Blade Runner but was made in the 90s.” This second approach consumes thousands of tokens per recommendation. The cost difference is not incremental; it is orders of magnitude. LLM inference cost economics, more than any algorithmic breakthrough, is now the dominant force shaping recommender architecture.

The 2026 consensus is a hybrid architecture: use the cheap, fast models for candidate generation from millions of items, then invoke the expensive reasoning layer only for the final dozen items a user actually sees. This “funnel” pattern — retrieval, then ranking, then re-ranking — is the only way to make the economics work. The smartest model is reserved for the fewest items.

What makes this work in practice goes back to a formalism from 1933: the multi-armed bandit. Imagine a gambler facing a row of slot machines, each with an unknown payout rate. She wants to maximize her winnings over a night of play. If she always pulls the arm with the highest observed payout, she might miss a better machine she never tried. If she explores too much, she wastes money on losers. The mathematics of this exploration–exploitation tradeoff define regret:

$$ R(T) = \mu^* \cdot T - \sum_{t=1}^{T} \mu(a_t) $$

Here μ* is the best possible average reward, and μ(aₜ) is the reward from whatever arm she actually pulled at time t. Total regret is how much she left on the table by not knowing the optimal choice in advance. The goal of every multi-armed bandit algorithm in recommender systems is to drive this quantity sublinear in T — to learn fast enough that the cost of exploration vanishes relative to the horizon. Multi-armed bandit recommender system diagram: a Learner taking Actions and receiving Rewards from an Environment, with the goal to maximize cumulative reward or minimize cumulative regret Multi-armed bandit recommender system diagram: a Learner taking Actions and receiving Rewards from an Environment, with the goal to maximize cumulative reward or minimize cumulative regret The three main exploration strategies each take a different approach: epsilon-greedy adds random noise to avoid getting stuck; Upper Confidence Bound (UCB) prefers actions with uncertain values; Thompson Sampling selects actions according to the probability they are optimal. In practice, Thompson Sampling tends to outperform the others because its exploration is guided by posterior uncertainty rather than arbitrary randomness — it explores where it matters most. Principles of Exploration in recommender systems: Naive Exploration (ε-greedy), Optimism in the Face of Uncertainty (UCB), and Probability Matching (Thompson Sampling) Principles of Exploration in recommender systems: Naive Exploration (ε-greedy), Optimism in the Face of Uncertainty (UCB), and Probability Matching (Thompson Sampling) Every recommendation you see on Netflix’s homepage is the output of an algorithm trying to minimize exactly this quantity, whether it realizes it or not.

Netflix’s recommendation algorithm architecture runs this optimization across three computation layers. Offline systems crunch terabytes of viewing history to train deep collaborative filtering models, a process that takes hours and happens on a schedule. Nearline systems update user embeddings seconds after a click, keeping the recommendations fresh without the cost of full retraining. Online systems respond to each page load in milliseconds, combining the precomputed signals with real-time context like time of day and device type. The architecture is a latency-cost tradeoff: deep analysis happens in batch, while the user-facing layer stays fast. Netflix recommendation algorithm architecture: Member Activity and Contextual Information flow through an Offline System for model training, then to an Online System where the Multi-Armed Bandit produces recommendations Netflix recommendation algorithm architecture: Member Activity and Contextual Information flow through an Offline System for model training, then to an Online System where the Multi-Armed Bandit produces recommendations What Netflix learned from a decade of experimentation is counterintuitive. The goal is not to recommend what users will definitely watch, but what they would not have found on their own. They call this “incrementality.” A greedy algorithm that always surfaces the highest-probability titles just confirms what users already knew — it exploits without exploring, and in doing so collapses the discovery space. A better approach is to measure the causal effect of the recommendation: how much does showing this thumbnail increase the probability of a play compared to not showing it? Some titles have low baseline interest but high incrementality. Those are the ones worth featuring. This is the exploration–exploitation tradeoff made concrete: the value of a recommendation is not its predicted rating, but its marginal contribution to discovery. Netflix incrementality analysis: scatter plot showing incremental probability vs baseline probability, where Title A has low baseline but high incremental lift, while Title C has high baseline but less benefit from featuring Netflix incrementality analysis: scatter plot showing incremental probability vs baseline probability, where Title A has low baseline but high incremental lift, while Title C has high baseline but less benefit from featuring Spotify’s AI DJ recommender system takes a different approach to the same problem. Their “AI DJ” feature uses what engineers internally call the “agentic router.” When you ask for “music for a rainy reading session in 1990s Seattle,” the router decides whether to invoke the expensive LLM reasoning layer or just fall back to keyword matching against collaborative filtering embeddings. Complex queries get the big model; simple ones get the fast path. This router is the economic governor of the entire system — an inference cost optimizer disguised as a product feature. Underneath the DJ’s personality, built on Spotify’s Sonantic voice synthesis and LLM-generated contextual narratives, sits a bandit framework called BaRT (Bandits for Recommendations as Treatments) that quietly balances what you know you like against what you might not yet know you need.

Not everyone is convinced the algorithms are making us better off. My own analysis of social media success prediction found that sophisticated language models often just memorize temporal patterns rather than learning what actually makes content good. They learn the news cycle, not the news.

The risk is that we build hybrid recommender systems that are technically brilliant but experientially hollow, engineering away the serendipity that made discovery meaningful in the first place. The recommender is becoming a curator, and the curator is becoming an agent. The architecture will keep evolving — foundation models for recommendations, reinforcement learning from human feedback applied to discovery, inference costs that continue their 10× annual decline — but the open question for 2026 is whether we want to be the curators of our own lives, or merely consumers of an optimized feed.

Slides courtesy of “A Multi-Armed Bandit Framework for Recommendations at Netflix” by Jaya Kawale, Netflix.

Is Private Equity Just Beta With a Lockup?

The pitch used to be simple: accept illiquidity, get rewarded. Lock up your capital for seven years, tolerate capital calls and J-curves, and in exchange you’d earn returns that public markets couldn’t touch. It was the defining bargain of institutional investing for two decades.

AQR’s latest capital market assumptions make for uncomfortable reading if you’re an allocator to private markets. Their expected real return for U.S. buyouts over the next 5-10 years is 4.2%. For U.S. large cap public equities, it’s 3.9%. That’s a 30 basis point premium for accepting years of lockup, unpredictable capital calls, limited transparency, and the very real risk of picking the wrong manager. AQR Exhibit 6: Expected real returns for private assets showing U.S. Buyouts at 4.2%, U.S. Real Estate at 3.1%, and U.S. Private Credit at 2.6% AQR Exhibit 6: Expected real returns for private assets showing U.S. Buyouts at 4.2%, U.S. Real Estate at 3.1%, and U.S. Private Credit at 2.6% Private credit looks even worse. Expected returns dropped 0.5 percentage points year over year as spreads narrowed and base rates came down. The asset class that was supposed to be the sensible alternative to stretched equity valuations now offers less compensation than it did twelve months ago.

This isn’t a temporary dislocation. It’s the logical endpoint of too much capital chasing the same opportunities. When every pension fund, endowment, and sovereign wealth fund decides they need 20-30% allocation to alternatives, the returns that made alternatives attractive get arbitraged away. The money didn’t find alpha. It became beta (with a lockup).

I read more reports and the a16z State of the Markets 2026 isn’t less interesting. The dispersion numbers tell an interesting story. In venture capital, top decile managers generate 31.7% IRR while bottom decile managers return negative 7%. The spread between winners and losers is enormous. But that spread is precisely why average returns have compressed. Access to top-tier funds has always been limited, and everyone else is fighting over what’s left. Net IRR dispersion by strategy for 2002-2019 vintages showing venture capital with top decile at 31.7% and bottom decile at negative 7% Net IRR dispersion by strategy for 2002-2019 vintages showing venture capital with top decile at 31.7% and bottom decile at negative 7% AQR’s framework suggests something that few allocators want to hear: the illiquidity premium might be negative for most investors. If you’re not in the top quartile of manager selection, you’re accepting lockup risk for returns you could approximate in public markets with better liquidity and lower fees.

The counterargument, and it’s a reasonable one, is that private markets offer exposure to companies you simply can’t access in public markets anymore. This part is true. 87% of U.S. companies with more than $100 million in revenue are now private. The top 10 private companies represent 38% of total unicorn valuation, and that share has nearly doubled since 2020. SpaceX, OpenAI, Anthropic, Databricks, Stripe: these are category-defining businesses, and they’re not on any exchange. Share of U.S. companies with annual revenue greater than $100M showing private companies dominate Share of U.S. companies with annual revenue greater than $100M showing private companies dominate Top 10 private companies represent 38% of total unicorn valuation in 2025, including SpaceX, OpenAI, Anthropic, Databricks, and Stripe Top 10 private companies represent 38% of total unicorn valuation in 2025, including SpaceX, OpenAI, Anthropic, Databricks, and Stripe But access isn’t the same as returns. You can have exposure to the most exciting companies in the world and still underperform a boring index fund if you pay too much or pick the wrong vintage. The S&P 500 minimum market cap eligibility has tripled since 2019 to $22.7 billion. Companies are staying private longer, which means more value creation happens before public investors get a chance. It also means private investors are paying up for that privilege.

Value creation has moved earlier in the company lifecycle. For IPOs between 2014-2019, only 12% of median value was created in private markets. For 2020-2023 IPOs, that number jumped to 55%. If you want to capture returns from the next generation of important companies, you probably need private market exposure. Return potential has shifted to private markets: median value created in private markets went from 12% for 2014-2019 IPOs to 55% for 2020-2023 IPOs Return potential has shifted to private markets: median value created in private markets went from 12% for 2014-2019 IPOs to 55% for 2020-2023 IPOs The question really is what you’re paying for it.At 4.2% expected returns versus 3.9% for public equities, you’re paying in liquidity and flexibility for almost nothing in expected return. The premium that justified the allocation model has been competed away. If you’re in the top 5% of venture funds earning 60%+ IRR, none of this applies. For everyone else, the world has moved on.

Britain's Strategic Limbo

The UK is the country with no bloc.

At Davos, Britain refused to join Trump’s Board of Peace, citing commitment to international law and rejection of the “pay-to-play” model. France, Germany, Sweden, Norway made the same choice. The difference is that those countries have somewhere else to go. Britain doesn’t.

The SAFE instrument, the EU’s €150 billion fund for joint defense procurement, is designed explicitly for strategic autonomy. Strict “Buy European” provisions limit non-EU subcontractors to 15-35% of contract value, phased out within two years. Canada, remarkably, negotiated access and now has preferential treatment on par with EU firms. The UK remains excluded.

Talks broke down in late 2025. London viewed the EU’s requirements for third-country participation as an infringement on sovereignty. The same sovereignty concerns that drove Brexit now lock Britain out of the emerging European defense architecture. The “mid-Atlantic bridge” was always a metaphor. Britain positioned itself as the hinge between American power and European integration, useful to both, dependent on neither. That positioning assumed both poles wanted a bridge. Now the US treats allies as protection rackets and the EU is building walls around its defense industrial base. The bridge has nowhere to land.

What does the Starmer government do? The choices were supposed to be theoretical. Align with Washington and accept the transactional terms of the Donroe Doctrine. Align with Brussels and accept the sovereignty constraints of SAFE participation. Or go it alone, with a defense budget that can’t sustain independent capability against peer competitors.

The IISS analysis of SAFE’s implications for non-EU suppliers is blunt: firms outside the bloc face structural disadvantages that compound over time. Procurement cycles last decades. If British defense firms are locked out of European contracts now, the gap widens with each passing year. The industrial base erodes.

“Global Britain” was the slogan after Brexit, a vision of nimble bilateral relationships unconstrained by Brussels bureaucracy. The reality is that global influence requires either hard power or bloc membership. Britain has neither the military budget for the former nor the political will for the latter.

Canada’s pivot is instructive. Facing similar pressure from Washington, Carney diversified, joining SAFE, negotiating with Beijing, building horizontal coalitions with other middle powers. Britain has done none of this. It refused the Board of Peace on principle but hasn’t found an alternative structure to join on pragmatism.

Principles without alternatives is just isolation. The UK is learning what it means to be a middle power without a coalition, morally opposed to the new American order but structurally excluded from the European one.

The Rise of Middle Power Realism

At Davos 2026, Canadian Prime Minister Mark Carney delivered a speech that received something rare at these gatherings: a standing ovation. Carney told the assembled elites what they already knew but hadn’t said aloud: the world is not in a “transition” but a “rupture.”

The speech drew on Václav Havel’s 1978 essay The Power of the Powerless, specifically the parable of the greengrocer who displays the slogan “Workers of the World, Unite!” in his shop window. The grocer doesn’t believe the slogan. He displays it to signal submission, to live in harmony with the regime. Carney’s application was pointed: for years, US allies have displayed the signs of the liberal international order, pretending the partnership was mutual, that rules mattered, that values were shared. Even as reality diverged.

“It is time for companies and countries to take their signs down.”

What followed the speech was more interesting than the speech itself. Days later, Canada became the first non-European G7 nation to join the EU’s SAFE defense initiative, a €150 billion fund for joint European defense procurement. Canadian firms now have preferential access to the European defense market, treated on par with EU companies. Days before Davos, Carney had traveled to Beijing to secure a preliminary trade agreement on electric vehicles, 49,000 units at 6.1% tariff, compared to the 100% tariff the US imposes.

The intellectual framework Carney articulated has a name now: “middle power realism.” It’s built on three observations.

(1) The US is no longer a reliable partner. Not because of Trump specifically, but because American politics has shifted in ways that make transactional unilateralism the new baseline. The “Donroe Doctrine”, a portmanteau of “Donald” and “Monroe”, asserts American hegemony over the Western Hemisphere with a resource-driven, security-focused twist. It treats allies as protection rackets and international law as an impediment.

(2) Nostalgia is dangerous. The pre-2016 order isn’t coming back. Waiting for “normal” to return is a strategy for decline. Middle powers that don’t build domestic strength and horizontal coalitions will find themselves, as Carney put it invoking Thucydides, “on the menu.”

(3) Sovereignty requires the capacity to say no. That means diversified partnerships, even with rivals. Canada’s China deal infuriated Trump, who accused Carney of allowing a “Trojan Horse” into the continent. But from Ottawa’s perspective, the ability to trade with Beijing is precisely what makes Canadian sovereignty credible. You can’t negotiate from strength if you have no alternatives.

The European response follows similar logic. During the Greenland crisis, when Trump threatened tariffs on eight European nations and refused to rule out military force to “secure” the island, the EU threatened to deploy its Anti-Coercion Instrument against the United States. For the first time, the bloc signaled willingness to engage in a trade war with its primary security guarantor to protect the sovereignty of a member state.

The SAFE instrument itself is designed for strategic autonomy. Strict “Buy European” provisions limit subcontractors from non-EU countries to 15-35% of contract value, phased out within two years. The explicit goal is ITAR-free supply chains, defense procurement that doesn’t depend on American permission. Meanwhile, the UK, which refused Trump’s Board of Peace but remains excluded from SAFE due to post-Brexit negotiating failures, finds itself in strategic limbo. Alienated from Washington, locked out of European defense architecture, the “mid-Atlantic bridge” is collapsing.

There’s a strange inversion happening in the international system. At Davos, China positioned itself as the defender of the UN Charter, rejecting Trump’s “Board of Peace” as a parallel structure that undermines international law. The authoritarian superpower defending liberal institutions while the democratic superpower seeks to dismantle them. China benefits from a multipolar system with weak enforcement mechanisms. The US benefits from a unipolar system where it makes the rules. Middle powers benefit from rules that constrain the strong, which is why the Global South found validation in Carney’s speech. The admission that the “Rules-Based Order” was often cover for Western interests resonated with nations that experienced that hypocrisy firsthand.

The term “middle power” has always been slightly embarrassing, an admission of limits, a confession that you’re not at the top table. But there’s a realism emerging in these countries that the great powers lack. They can’t afford illusions about the international system because they don’t control it. They have to see clearly or get crushed.

Carney’s greengrocer metaphor cuts both ways. Yes, taking down the sign exposes the illusion. But it also means operating without the protection the illusion provided. The grocer who removes the slogan faces consequences. So do countries. Canada is betting it can navigate between giants, trading with China, defending alongside Europe, maintaining what leverage it has with Washington. The EU is betting it can build autonomous defense capacity fast enough to matter. Japan, Australia, and others are making similar calculations, hedging relationships that used to be taken for granted.

The Most Expensive Assumption in AI

Sara Hooker’s paper arrived with impeccable timing. On the slow death of scaling dropped just as hyperscalers are committing another $500 billion to GPU infrastructure, bringing total industry deployment into the scaling thesis somewhere north of a trillion dollars. I’ve been tracking these capital flows for my own portfolio. Either Hooker is early to a generational insight or she’s about to be very publicly wrong. Hyperscaler AI capital expenditure 2019-2025 Hyperscaler AI capital expenditure 2019-2025 The core argument is very simple: bigger is not always better. Llama-3 8B outperforms Falcon 180B. Aya 23 8B beats BLOOM 176B despite having only 4.5% of the parameters. These are not isolated flukes. Hooker plots submissions to the Open LLM Leaderboard over two years and finds a systematic trend where compact models consistently outperform their bloated predecessors. The bitter lesson, as Rich Sutton framed it, was that brute force compute always wins. Hooker’s counter is that maybe we’ve been held hostage to “a painfully simple formula” that’s now breaking down. Model size vs benchmark performance showing smaller models outperforming larger ones Model size vs benchmark performance showing smaller models outperforming larger ones Scaling laws, she notes, only reliably predict pre-training test loss. When you look at actual downstream performance, the results are “murky or inconsistent.” The term “emergent properties” gets thrown around to describe capabilities that appear suddenly at scale, but Hooker points out this is really just a fancy way of admitting we have no idea what’s coming. If your scaling law can’t predict emergence, it’s not much of a law.

Gary Marcus has been making a related argument from a different angle. The cognitive scientist, whose 2001 book predicted hallucination problems, calls LLMs “glorified memorization machines” that work because the internet contains answers to most common queries. His framing is less academic and more market-oriented: the jump from GPT-1 to GPT-4 showed obvious qualitative leaps requiring no benchmarks. The jump from GPT-4 to GPT-5? Marginal improvements requiring careful measurement. The textbook definition of diminishing returns.

The market signals are worth watching. According to Goldman Sachs data, hedge fund short interest in utilities now sits at the 99th percentile relative to the past five years. Utilities. The bet appears to be that AI data center demand, the premise on which American Electric Power trades at $65 billion, may not materialize as expected. Meanwhile, names like Bloom Energy, Oracle, and various AI-adjacent plays are showing up on heavily-shorted lists. Hedge funds aren’t yet betting against Nvidia directly, but they’re circling the weaker members of the herd.

There’s a certain irony here that Hooker captures well. Academia was effectively priced out of meaningful AI research by the compute arms race. The explosion in necessary compute “marginalized academia from meaningfully participating in AI progress.” Industry labs stopped publishing to preserve commercial advantage. Now, as scaling hits diminishing returns, the skills that matter shift back toward algorithmic cleverness, data quality, and architectural innovation. Things that don’t require a billion-dollar data center. If you got priced out of the game, the game may be coming back to you. Hooker writes,

The less reliable gains from compute makes our purview as computer scientists interesting again

The quiet tell is how frontier labs are actually behaving. Major players are now incorporating classical symbolic tools, things like Python interpreters and code execution, into LLM pipelines. These symbolic components run on CPUs, not GPUs. Ilya Sutskever, coauthor of the 2012 ImageNet paper and OpenAI cofounder, publicly stated that

We need to go back to the age of research

Shorting the scaling thesis has been a widow-maker trade for the better part of three years. Nvidia is up roughly 800% since 2022. As I’ve written before, the market can remain irrational longer than you can remain solvent, and that applies to both directions. OpenAI reportedly burns around $3 billion monthly with a $40 billion funding round implying perhaps 13 months of runway. If the next mega-round prices down or requires distressed terms, that’s your signal. Until then, the thesis may be directionally correct on the technical limitations while the timing remains treacherous.

We can only see a short distance ahead, but we can see plenty there that needs to be done.

As Alan Turing put it, and Hooker quotes approvingly. The scaling era produced real capabilities alongside real capital misallocation. What comes next is genuinely uncertain. That uncertainty cuts both ways.

Against All Odds: The Mathematics of 'Provably Fair' Casino Games PROJECT


Gambling can be harmful and lead to significant losses. Participation is subject to local laws and age restrictions. Always gamble responsibly. Need help? Visit BeGambleAware.org


Crash games represent a category of online gambling where players place bets on an increasing multiplier that can ‘crash’ at any moment. The fundamental mechanic requires players to cash out before the crash occurs; successful cash-outs yield the bet amount multiplied by the current multiplier, while failure results in total loss of the wager.

Crash game showing an airplane flying with increasing multiplier until it crashes Crash game showing an airplane flying with increasing multiplier until it crashes

The specific game I came across is a variant that employs an aircraft flight metaphor. Let’s call it Plane Game. What intrigued me wasn’t the game itself but that it said “provably fair” on the startup screen, which I assumed to be a typo at first. I stand corrected:

A provably fair gambling system uses cryptography to let players verify that each outcome was generated from fixed inputs, rather than chosen or altered by the operator after a bet is placed. The casino commits to a hidden “server seed” via a public hash, combines it with a player-controlled “client seed” and a per-bet nonce, and later reveals the server seed so anyone can recompute and confirm the result.

The stated Return-to-Player (RTP) of that specific game is 97%, implying a 3% house edge. After watching a few rounds, the perceived probability felt off. And if there’s something that gets my attention, it’s the combination of games and statistics. So I did what any reasonable person would do: I watched another 20,000 rounds over six days (112 hours total) and wrote a paper about it. Script recording 20000 rounds over six days (112 hours total) Script recording 20000 rounds over six days (112 hours total)

The distribution below shows the classic heavy tail: most rounds crash quickly at low multipliers, while rare events produce 100x or even 1000x payouts. The maximum I observed was 10,000x. This extreme variance creates the illusion of big wins just around the corner while the house edge operates relentlessly over time. Heavy-tailed distribution of crash multipliers on log-log scale showing most rounds end at low multipliers while rare events exceed 100x or 1000x, with maximum observed at 10,000x Heavy-tailed distribution of crash multipliers on log-log scale showing most rounds end at low multipliers while rare events exceed 100x or 1000x, with maximum observed at 10,000x For a crash game with RTP = r (where 0 < r < 1), the crash multiplier M follows a specific probability distribution. The survival function is particularly relevant:

$$P(M \geq m) = \frac{r}{m}$$

This means the probability of reaching at least multiplier m before crashing equals r/m. For any cash-out target, the expected value of a unit bet works out to:

$$E[\text{Profit}] = P(M \geq m) \times m - 1 = \frac{r}{m} \times m - 1 = r - 1 = -0.03$$

This mathematical property makes crash games theoretically “strategy-proof” in expectation. No cash-out timing strategy should yield better long-term results than another. Survival probability curve on log-log scale showing probability of reaching target multiplier: 2x succeeds 48.5% of the time, 5x at 19.6%, 10x at 9.7%, 50x at 2.0%, and 100x at just 1.1% Survival probability curve on log-log scale showing probability of reaching target multiplier: 2x succeeds 48.5% of the time, 5x at 19.6%, 10x at 9.7%, 50x at 2.0%, and 100x at just 1.1% The empirical data matches theory almost perfectly. A 2x target succeeds about 48.5% of the time. Aiming for 10x? That works only 9.7% of rounds. The close fit between my observations and the theoretical line confirms the stated 97% RTP.

So is the game fair? My analysis says yes. Using three different statistical methods (log-log regression, maximum likelihood, and the Hill estimator), I estimated the probability density function exponent at α ≈ 1.98, within 2.2% of the theoretical value of 2.0. This contrasts with Wang and Pleimling’s 2019 research that found exponents of 1.4 to 1.9 for player cashout distributions. The key distinction: their deviations reflect player behavioral biases (probability weighting), not game manipulation. The random number generator produces fair outcomes. Q-Q plot comparing empirical vs theoretical quantiles with perfect fit line and 10% confidence band, showing close alignment confirming fair random number generation Q-Q plot comparing empirical vs theoretical quantiles with perfect fit line and 10% confidence band, showing close alignment confirming fair random number generation I then ran Monte Carlo simulations of 10,000 betting sessions under four different strategies: conservative 1.5x cashouts, moderate 2.0x, aggressive 3.0x, and high-risk 5.0x targets. Strategy comparison boxplot showing session returns for 100 rounds: 1.5x Conservative averages -2.9%, 2.0x Moderate -2.4%, 3.0x Aggressive -3.3%, and 5.0x High Risk -3.5%, all negative Strategy comparison boxplot showing session returns for 100 rounds: 1.5x Conservative averages -2.9%, 2.0x Moderate -2.4%, 3.0x Aggressive -3.3%, and 5.0x High Risk -3.5%, all negative Every single strategy produces negative expected returns. The conservative approach has lower variance but still loses. The aggressive strategies lose faster with higher variance. Simulated player sessions using 1.5x strategy over 200 rounds showing multiple trajectories trending toward expected loss line of -3% per round Simulated player sessions using 1.5x strategy over 200 rounds showing multiple trajectories trending toward expected loss line of -3% per round The consumer protection angle is what concerns me most. My data revealed 179 rounds per hour with 16-second median intervals. At that pace, with a 3% house edge per round, players face expected losses exceeding 500% of amounts wagered per hour of play. The manual cashout mechanic creates an illusion of control, masking the deterministic nature of losses.

The game is provably fair in the cryptographic sense. The mathematics check out. But mathematical fairness doesn’t ensure consumer safety. The house always wins, and it wins fast.

The only winning strategy is not to play

The full paper preprint with methodology and statistical details is available on SSRN. Code and data are on GitHub.

Enterprise AI Strategy is Backwards

That’s the claim made by LinkedIn co-founder Reid Hoffman. It’s a bold assertion, so I set out to investigate whether the data supports it. Report Header Overview Report Header Overview The result is a comprehensive report, backed by more than 30 sources. You can download the full report and the accompanying presentation for free.


Global AI spending hit $13.8 billion; a six-fold increase since late 2023. Yet 85% of AI projects never reach production. Only 26% of companies can translate pilots into outcomes. The gap between ambition and execution has become so predictable that Gartner now officially places generative AI in the “trough of disillusionment.”

There’s an economic concept called Jevons paradox (yes, I referenced this before). When efficiency improves for a resource, consumption increases, not decreases. Coal-efficient steam engines didn’t reduce coal usage, they made coal so useful that demand exploded. The same logic applies to organizational communication. Email was supposed to reduce meetings. Slack was supposed to reduce email. AI was supposed to reduce everything.

Instead, the average employee now spends 57% of their workday on coordination: communicating, updating, aligning. Meetings alone cost the US economy $532 billion per year. This is the coordination layer, where organizations actually run, and where organizations quietly bleed.

Three observations:

(1) Only 26% of companies have the maturity to translate AI pilots into outcomes. The rest are layering AI on legacy workflows instead of redesigning them.
(2) Language models bridge the gap between messy human communication and structured data. Transcripts to CRM fields. Teams using these tools report 30% higher win rates and 80% less manual work.
(3) AI gains compound when shareable. A summary helps one person. A system that captures and distributes knowledge helps everyone downstream.

The coordination layer isn’t glamorous. It’s transcripts, status updates, action items, CRM entries. It’s the administrative exhaust of getting anything done with other people. And it’s almost entirely composed of language. We have language models now. Models that extract structured data from messy transcripts, convert meeting notes into CRM fields with 99% accuracy. Sales teams using these tools report 30% higher win rates and 80% less manual work.

Yet most enterprise AI strategies ignore this entirely. They’re focused on chatbots and demos for board presentations. Meanwhile, the language processing that constitutes the primary workload of any modern business remains stuck in the same recursive loops. The winners won’t be companies with great AI announcements. They’ll be the ones building daily habits early enough for the gains to stack.