philippdubach

June 22, 2025 • ∞

Behavioral Economics & Transit Policy →

Over the weekend a WSJ editorial on the 2025 New York City mayoral election called one of the potential Democratic candidates Zohran Mamdani “a literal socialist” for - among other things - running on the promise of free bus rides for all:

Zohran won New York’s first fare-free bus pilot on five lines across the city. As Mayor, he’ll permanently eliminate the fare on every city bus […] Fast and free buses will not only make buses reliable and accessible but will improve safety for riders and operators – creating the world-class service New Yorkers deserve.

Free public transit seems to be a recurring idea among politicians: For some reason, making it free feels revolutionary in a way that making it cheaper never could. There’s actually some solid behavioral economics behind this intuition: “Zero as a Special Price: The True Value of Free Products.” (Yes, before the fabricated data scandals, Ariely did write research that has replicated consistently.) The basic finding: people don’t treat “free” as just another very low price. When you price something at zero, it gets a special psychological boost that makes people value it way more than they should based on pure cost-benefit analysis: Give people a choice between a Hershey’s Kiss for 1¢ and a Lindt truffle for 15¢. Most people choose the obviously superior Lindt. Now make it Hershey’s for free versus Lindt for 14¢—keeping the price difference exactly the same—and suddenly everyone wants the Hershey’s. Free doesn’t just eliminate cost; it creates additional perceived value. The mechanism is pure affect. “Free” makes people feel good in a way that “1¢” doesn’t, even though the economic difference is trivial. When you force people to think analytically about the trade-offs, the effect disappears. But in normal decision-making, that warm fuzzy feeling of getting something for nothing dominates rational calculation.

The difference between a $2.75 bus fare and $0 isn’t meaningfully different from the difference between $2.75 and $0.75 for most riders’ budgets. But psychologically? Free transit feels like a gift from the city. Cheap transit feels like commerce. The first activates social norms (gratitude, civic participation, shared ownership). The second activates market norms (cost-benefit analysis, value-for-money calculations, consumer complaints when service is bad). On the other side, any positive price, no matter how small, forces people into analytical mode. People start thinking about trade-offs, evaluating whether the service is worth it, considering alternatives. This is why congestion pricing works so well. A $5 charge to drive in Manhattan (or Singapore, London, Stockholm, Milan, Gothenburg) isn’t going to bankrupt anyone who can afford to drive in Manhattan. But it makes people think about each trip in a way they never did when driving felt “free” (ignoring gas, parking, insurance, etc.). Once you’re thinking analytically rather than just following habit, you’re much more likely to take the subway.

But!! Free transit might actually make it easier to cut transit funding, not harder. Right now, when transit agencies face budget cuts, fare-paying riders get angry. They’re customers! They paid for service! They demand value for money! This creates a natural constituency defending transit budgets. Make transit free, and you’ve eliminated that market relationship. Riders become passive beneficiaries rather than paying customers. When service gets worse, they can’t complain about not getting their money’s worth; they’re getting exactly what they paid for. If I were a politician looking to slash subsidies without political blowback, step one would be eliminating fares. Tell everyone it’s about equity and access. Then, once people stop thinking of themselves as customers, start the real cuts. No more late-night service; hey, it’s free! Longer waits, dirtier stations, broken escalators; what did you expect for nothing? The behavioral economics is clear: when something is free, people have lower expectations and less standing to complain. The zero-price effect works both ways. None of this means transit shouldn’t be affordable!

June 15, 2025 • ∞

It Just Ain’t So →

It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

This (not actually) Mark Twain quote from The Big Short captures the sentiment of realizing that some foundational assumptions might be empirically wrong.

A recent article by Anton Vorobets that I came across in Justina Lee’s Quant Newsletter presents compelling evidence that challenges one of the field’s fundamental statistical assumptions, that asset returns follow normal distributions. Using 26 years of data from 10 US equity indices, he ran formal normality tests (Shapiro-Wilk, D’Agostino’s K², Anderson-Darling) and found that the normal distribution hypothesis gets rejected in most cases. The supposed “Aggregational Gaussianity” that academics invoke through Central Limit Theorem arguments? It’s mostly wishful thinking enabled by small sample sizes. As Vorobets observes:

Finance and economics academia is unfortunately driven by several convenient myths, i.e., claims that are taken for granted and spread among university academics despite their poor empirical support.

The article highlights significant practical consequences for portfolio management and risk assessment. Portfolio optimization based on normal distribution assumptions ignores fat left tails—exactly the kind of extreme downside events that can wipe out portfolios. This misspecification can lead to inadequate risk management and suboptimal asset allocation decisions. Vorobets suggests alternative approaches, including Monte Carlo simulations combined with Conditional Value-at-Risk (CVaR) optimization, which better accommodate the complex distributional properties observed in financial data. While computationally more demanding, these methods offer improved alignment with empirical reality.

Reading this piece gave me a few ideas for extensions I might want to explore in an upcoming personal project: (1) While Vorobets focuses on US equity indices, similar analysis across fixed income, commodities, currencies, and alternative assets would provide a more comprehensive view of distributional properties across financial markets. Each asset class exhibits distinct market microstructure characteristics that may influence distributional behavior. (2) Global Market Coverage: Extending the geographic scope to include developed, emerging, and frontier markets would illuminate whether the documented deviations from normality represent universal phenomena or are specific to US market structures. Cross-regional analysis could reveal important insights about market development, regulatory frameworks, and institutional differences. (3) Building upon Vorobets’ foundation, there are opportunities to incorporate multivariate normality testing, regime-dependent analysis, and time-varying parameter models. Additionally, investigating the power and robustness of different statistical tests across various market conditions would strengthen the methodological contribution. (4) Examining different time horizons, market regimes (pre- and post-financial crisis, COVID period), and potentially higher-frequency data could provide deeper insights into when and why distributional assumptions break down.

June 12, 2025 • ∞

Not All AI Skeptics Think Alike →

Apple’s recent paper “The Illusion of Thinking” has been widely understood to demonstrate that reasoning models don’t ‘actually’ reason. Using controllable puzzle environments instead of contaminated math benchmarks, they discovered something fascinating: there are three distinct performance regimes when it comes to AI reasoning complexity. For simple problems, standard models actually outperform reasoning models while being more token-efficient. At medium complexity, reasoning models show their advantage. But at high complexity? Both collapse completely. Here’s the kicker: reasoning models exhibit counterintuitive scaling behavior—their thinking effort increases with problem complexity up to a point, then declines despite having adequate token budget. It’s like watching a student give up mid-exam when the questions get too hard, even though they have plenty of time left.

We observe that reasoning models initially increase their thinking tokens proportionally with problem complexity. However, upon approaching a critical threshold—which closely corresponds to their accuracy collapse point—models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty.

The researchers found something even more surprising: even when they provided explicit algorithms—essentially giving the models the answers—performance didn’t improve. The collapse happened at roughly the same complexity threshold. On the other hand, Sean Goedecke is not buying Apple’s methodology: His core objection? Puzzles “require computer-like algorithm-following more than they require the kind of reasoning you need to solve math problems.”

You can’t compare eight-disk to ten-disk Tower of Hanoi, because you’re comparing “can the model work through the algorithm” to “can the model invent a solution that avoids having to work through the algorithm”.

From his own testing, models “decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to even start.” That’s strategic behavior, not reasoning failure. This matters because it shows how evaluation methodology shapes our understanding of AI capabilities. Goedecke argues Tower of Hanoi puzzles aren’t useful for determining reasoning ability, and that the complexity threshold of reasoning models may not be fixed.

May 31, 2025 • ∞

Your AI Assistant Might Rat You Out →

There was this story going around the past few days

Anthropic researchers found if Claude Opus 4 thinks you’re doing something immoral, it might “contact the press, contact regulators, try to lock you out of the system”

Mostly driven by a Sam Bowman tweet referring to the Claude 4 System Card section 4.1.9 on high-agency behavior. The outrage was mostly by people misunderstanding the prerequisites necessary for such a scenario. Nevertheless, an interesting question emerged: What happens when you feed an AI model evidence of fraud and give it an email tool? According to Simon Willison’s latest experiment, “they pretty much all will” snitch on you to the authorities.

A fun new benchmark just dropped! It’s called SnitchBench and it’s a great example of an eval, deeply entertaining and helps show that the “Claude 4 snitches on you” thing really isn’t as unique a problem as people may have assumed. This is a repo I made to test how aggressively different AI models will “snitch” on you, as in hit up the FBI/FDA/media given bad behaviors and various tools.

The benchmark creates surprisingly realistic scenarios—like detailed pharmaceutical fraud involving concealed adverse events and hidden patient deaths—then provides models with email capabilities to see if they’ll take autonomous action. This reveals something fascinating about AI behavior that goes beyond traditional benchmarks. Rather than testing reasoning or knowledge, SnitchBench probes the boundaries between helpful assistance and autonomous moral decision-making. When models encounter what appears to be serious wrongdoing, do they become digital whistleblowers?

The implications are both reassuring and unsettling. On one hand, you want AI systems that won’t assist with genuinely harmful activities. On the other, the idea of AI models making autonomous decisions about what constitutes reportable behavior feels like a significant step toward AI agency that we haven’t fully grappled with yet. Therefore, Anthropic’s own advice here seems like a good rule to follow:

Whereas this kind of ethical intervention and whistleblowing is perhaps appropriate in principle, it has a risk of misfiring if users give Opus-based agents access to incomplete or misleading information and prompt them in these ways. We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable.

May 30, 2025 •PROJECT

Modeling Glycemic Response with XGBoost

Earlier this year I wrote how I built a CGM data reader after wearing a continuous glucose monitor myself. Since I was already logging my macronutrients and learning more about molecular biology in an MIT MOOC I became curious if given a meal’s macronutrients (carbs, protein, fat) and some basic individual characteristics (age, BMI), these could serve as features in a regressor machine learning model to predict the curve parameters of the postprandial glucose curve (how my blood sugar levels change after eating). I came across a paper on Personalized Nutrition by Prediction of Glycemic Responses which did exactly that. Unfortunately, neither the data nor the code were publicly available. And - I wanted to predict my own glycemic response curve. So I decided to build my own model. In the process I wrote this working paper. The paper represents an exercise in applying machine learning techniques to medical applications. The methodologies employed were largely inspired by Zeevi et al.’s approach. I quickly realized that training a model on my own data only was not very promising if not impossible. To tackle this, I used the publicly available Hall dataset containing continuous glucose monitoring data from 57 adults, which I narrowed down to 112 standardized meals from 19 non-diabetic subjects with their respective glucose curve after the meal (full methodology in the paper). Overview of the CGM pipeline workflow Rather than trying to predict the entire glucose curve, I simplified the problem by fitting each postprandial response to a normalized Gaussian function. This gave me three key parameters to predict: amplitude (how high glucose rises), time-to-peak (when it peaks), and curve width (how long the response lasts). Overview of single fitted curve of cgm measurements The Gaussian approximation worked surprisingly well for characterizing most glucose responses. While some curves fit better than others, the majority of postprandial responses were well-captured, though there’s clear variation between individuals and meals. Some responses were high amplitude, narrow width, while others are more gradual and prolonged. Overview of selected fitted curves I then trained an XGBoost regressor with 27 engineered features including meal composition, participant characteristics, and interaction terms. XGBoost was chosen for its ability to handle mixed data types, built-in feature importance, and strong performance on tabular data. The pipeline included hyperparameter tuning with 5-fold cross-validation to optimize learning rate, tree depth, and regularization parameters. Rather than relying solely on basic meal macronutrients, I engineered features across multiple categories and implemented CGM statistical features calculated over different time windows (24-hour and 4-hour periods), including time-in-range and glucose variability metrics. Architecture wise, I trained three separate XGBoost regressors - one for each Gaussian parameter.

While the model achieved moderate success predicting amplitude (R² = 0.46), it completely failed at predicting timing - time-to-peak prediction was essentially random (R² = -0.76), and curve width prediction was barely better (R² = 0.10). Even the amplitude prediction, while statistically significant, falls well short of an R² > 0.7. Studies that have achieved better predictive performance typically used much larger datasets (>1000 participants). For my original goal of predicting my own glycemic responses, this suggests that either individual-specific models trained on extensive personal data, or much more sophisticated approaches incorporating larger training datasets, would be necessary.

The complete code, Jupyter notebooks, processed datasets, and supplementary results are available in my GitHub repository.
_ _

(10/06/2025) Update: Today I came across Marcel Salathé’s LinkedIn post on a publication out of EPFL: Personalized glucose prediction using in situ data only.

With data from over 1,000 participants of the Food & You digital cohort, we show that a machine learning model using only food data from myFoodRepo and a glucose monitor can closely track real blood sugar responses to any meal (correlation of 0.71).

As expected Singh et. al. achieve a substantially better predictive performance (R = 0.71 vs R² = 0.46). Besides probably higher methodological rigor and scientific quality, the most critical difference is sample size - their 1'000+ participants versus my 19 participants (from the Hall dataset) represents a fundamental difference in statistical power and generalizability. They addressed one of the shortcomings I faced by leveraging a large digital nutritional cohort from the “Food & You” study (including high-resolution data of nutritional intake of more than 46 million kcal collected from 315'126 dishes over 23'335 participant days, 1'470'030 blood glucose measurements, 49'110 survey responses, and 1'024 samples for gut microbiota analysis).

Apart from that I am excited to - at a first glance - observe the following similarities: (1) Both aim to predict postprandial glycemic responses using machine learning, with a focus on personalized nutrition applications. (2) Both employ XGBoost regression as their primary predictive algorithm and use similar performance metrics (R², RMSE, MAE, Pearson correlation). (3) Both extract comprehensive feature sets including meal composition (macronutrients), temporal features, and individual characteristics. (4) Both use mathematical approaches to characterize glucose responses - I used Gaussian curve fitting, while Singh et. al. use incremental area under the curve (iAUC). (5) Both employ cross-validation techniques for model evaluation and hyperparameter tuning. (6) SHAP Analysis: Both use SHAP for model interpretability and feature importance analysis.

May 30, 2025 • ∞

Gambling vs. Investing →

Kalshi, a prediction market startup, is using its federal financial license to offer sports betting nationwide, even in states where it’s not legal. The move has earned them cease-and-desist letters from state gaming regulators, but CEO Tarek Mansour isn’t backing down:

We can go one by one for every financial market and it would fall under the definition of gambling. So what’s the difference?

It’s a question that cuts to the heart of modern finance. The founders argue that Wall Street blurred the line between investing and gambling long ago, and casting Kalshi as the latter is inconsistent at best. They have a point—if you can bet on oil futures, Nvidia’s stock price, or interest rate movements, why is wagering on NFL touchdowns more objectionable?

Benefiting from the Trump administration’s hands-off regulatory approach, with the CFTC dropping its legal challenge to their election contracts, the odds might be in their favor. Even better, a Kalshi board member is awaiting confirmation to lead the very agency that was previously their biggest antagonist.

The technical distinction matters: Kalshi operates as an exchange between traders rather than a house taking bets against customers. But functionally, with 79% of their recent trading volume being sports-related, they’re forcing us to confront an uncomfortable reality about risk, speculation, and what we choose to call “investing.”

Whether you call it innovation or regulatory arbitrage, Kalshi is exposing the arbitrary nature of the lines we’ve drawn around acceptable financial speculation.
_ _

(17/06/2025) Update: Matt Levine - one of the finance columnists I enjoy reading most - just published a long piece “It’s Not Gambling, It’s Predicting” in his newsletter on exactly this issue:

Kalshi offers a prediction market where you can bet on sports. No! Sorry! Wrong! It offers a prediction market where you can predict which team will win a sports game, and if you predict correctly you make money, and if you predict incorrectly you lose money. Not “bet on sports.” “Predict sports outcomes for money.” Completely different.

May 28, 2025 • ∞

The Model Said So →

LLMs make your life easier until they don’t.

Their intrinsic complexity and lack of transparency pose significant challenges, especially in the highly regulated financial sector

Unlike other industries where “the model said so” might suffice, finance demands audit trails, bias detection, and explainable decision-making—requirements that sit uncomfortably with neural networks containing billions of parameters. The research highlights a fundamental tension that’s about to reshape fintech: the same complexity that makes LLMs powerful at parsing market sentiment or generating investment reports also makes them regulatory nightmares in a sector where you need to explain every decision to examiners.

May 21, 2025 • ∞

Dual Mandate Tensions →

Something interesting just happened at the National Bureau of Economic Research NBER

We study the optimal monetary policy response to the imposition of tariffs in a model with imported intermediate inputs. In a simple open-economy framework, we show that a tariff maps exactly into a cost-push shock in the standard closed-economy New Keynesian model, shifting the Phillips curve upward. We then characterize optimal monetary policy, showing that it partially accommodates the shock to smooth the transition to a more distorted long-run equilibrium—at the cost of higher short-run inflation.

Here’s where it gets interesting for current policy: Werning et. al. show that “optimal” monetary policy would actually calls for partial accommodation of tariff shocks—essentially allowing some inflation to persist to smooth the transition to what they euphemistically call “a more distorted long-run equilibrium.” With core PCE still running above the Fed’s 2% target and renewed tariff threats on the horizon, this research suggests Powell may need to abandon his recent dovish pivot and prepare for rate hikes that prioritize price stability over employment concerns. The dual mandate was never meant to be dual when the two mandates point in opposite directions.

May 11, 2025 • ∞

Beyond Monte Carlo: Tensor-Based Market Modeling →

A fascinating new paper from Stefano Iabichino at UBS Investment Bank explores what happens when you take the attention mechanisms powering modern AI and apply them to Wall Street’s most fundamental pricing problems, tackling what might be quantitative finance’s most intractable challenge.

The problem is elegantly simple yet profound: machine learning models are great at finding patterns in historical data, but financial theory demands that arbitrage-free prices be independent of past information. As the authors put it:

We contend that a fundamental tension exists between the usage of ML methodologies in risk and pricing and the First Fundamental Theorem of Finance (FFTF). While ML models rely on historical data to identify recurring patterns, the FFTF posits that arbitrage-free market prices are independent of past information.

Their solution? Transition Probability Tensors (TPTs) that function like attention mechanisms in neural networks, dynamically weighting relationships between risk factors while maintaining mathematical rigor. Instead of learning from history, these tensors capture “dynamic, context-aware relationships across dimensions” in real-time.

The practical results are impressive: simulating 210 quantitative investment strategies across 100,000 market scenarios in just 70 seconds, while identifying optimal hedging strategies and stress-testing future market conditions. The framework even adapts to different volatility regimes, shifting focus toward tail events during high-volatility periods—exactly like attention mechanisms focusing on relevant context. Whether it scales beyond this impressive proof-of-concept remains to be seen, but it’s seems to be a genuine attempt to resolve the fundamental tension between AI’s pattern-seeking nature and finance’s requirement for arbitrage-free pricing.

April 25, 2025 • ∞

DeFi's $42 Billion Maturity Story →

A new academic review by Ali Farhani reveals that institutional Total Value Locked in DeFi protocols hit $42 billion in 2024, with BlackRock leading the charge by launching a $250 million tokenized fund on Centrifuge.

The numbers tell a remarkable story of maturation. Layer 2 solutions like Optimism and Arbitrum now dominate the scaling landscape, while zero-knowledge proofs have reduced compliance costs by 30%. Even the terminology is evolving—researchers now discuss “Total Value Redeemable” instead of the traditional TVL metric, acknowledging that not all locked value is immediately liquid. Despite technological advances, security incidents persist with painful regularity: $350 million lost in the Wormhole bridge exploit, $81 million in Orbit Chain’s multi-signature failure. Cross-chain bridges remain “high-risk attack targets,” a sobering reminder that connecting different blockchains is still more art than science. The regulatory landscape is complicated as well. Europe’s MiCA regulation provides clear frameworks, while the SEC maintains its enforcement-first approach. Hong Kong’s innovation sandbox offers a third path, balancing experimentation with oversight.

DeFi is transitioning from a disruptive experiment to an integrated component of the global financial system

That transition isn’t complete—Layer 2 solutions are projected to host over 70% of DeFi TVL by mid-2025—but the direction is clear.