Sentiment Trading Revisited

Interesting new paper that builds on many of the ideas I explored in this project. The research, by Ayaan Qayyum, an Undergraduate Research Scholar at Rutgers, shows that the core concept of using advanced language models for sentiment trading is not only viable but highly effective. The study takes a similar but more advanced approach. Instead of using a model like GPT-3.5 to generate a simple sentiment score, it uses OpenAI’s embedding models to convert news headlines into rich, high-dimensional vectors. By training a battery of neural networks including

Gated Recurrent Units (GRU), Hidden Markov Model (HMM), Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and a Feed-Forward Neural Network (FFNN). All were implemented using PyTorch.

on these embeddings alongside economic data, the study found it could reduce prediction errors by up to 40% compared to models without the news data.

The most surprising insight to me, and one that directly addresses the challenge of temporal drift I discussed, was that Qayyum’s time-independent models performed just as well, if not better, than the time-dependent ones. By shuffling the data, the models were forced to learn the pure semantic impact of a headline, independent of its specific place in time. This suggests that the market reacts to the substance of news in consistent ways, even if the narratives themselves change.

Counting Cards with Computer Vision

After installing Claude Code

the agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster through natural language commands

I was looking for a task to test its abilities. Fairly quickly we wrote less than 200 lines of python code predicting black jack odds using Monte Carlo Simulation. When I went on to test this little tool on Washington Post’s online Black Jack (I also didn’t know that existed!) I quickly noticed how impractical it was to manually input all the card values on the table manually. What if the tool would also automatically recognize the cards that are on the table and calculate the odds from it? I have never done anything with computer vision so this seemed like a good challenge. alt text here To get to any reasonable result we have to start with classification where we “teach” the model to categorize data by showing them lots of examples with correct labels. But where do the labels come from? I manually annotated 409 playing cards across 117 images using Roboflow Annotate (at first I only did half as much - why this wasn’t a good idea we’ll see in a minute). Once enough screenshots of cards were annotated we can train the model to recognize the cards and predict card values on tables it has never seen before. I was able to use a NVIDIA T4 GPU inside Google Colab which offers some GPU time for free when capacity is available. alt text here During training, the algorithm learns patterns from this example data, adjusting its internal parameters millions of times until it gets really good at recognizing the differences between categories (in this case different cards). Once trained, the model can then make predictions on new, unseen data by applying the patterns it learned. With the annotated dataset ready, it was time to implement the actual computer vision model. I chose to run inference on Ultralytics’ YOLOv11 pre-trained model, a state-of-the-art object detection algorithm. I set up the environment in Google Colab following the “How to Train YOLO11 Object Detection on a Custom Dataset” notebook. After extracting the annotated dataset from Roboflow, I began training the model using the pre-trained YOLOv11s weights as a starting point. This approach, called transfer learning, allows the model to leverage patterns already learned from millions of general images and adapt them to this specific task. I initially set it up to run for 350 epochs, though the model’s built-in early stopping mechanism kicked in after 242 epochs when no improvement was observed for 100 consecutive epochs. The best results were achieved at epoch 142, taking around 13 minutes to complete on the Tesla T4 GPU. The initial results were quite promising, with an overall mean Average Precision (mAP) of 80.5% at IoU threshold 0.5. Most individual card classes achieved good precision and recall scores, with only a few cards like the 6 and Queen showing slightly lower precision values. Training results showing confusion matrix and loss curves However, looking at the confusion matrix and loss curves revealed some interesting patterns. While the model was learning effectively (as shown by the steadily decreasing loss), there were still some misclassifications between similar cards, particularly among the numbered cards. This highlighted exactly why I mentioned earlier that annotating only half the amount of data initially “wasn’t a good idea” - more training examples would likely improve these edge cases and reduce confusion between similar-looking cards. My first attempt at solving the remaining accuracy issues was to add another layer to the workflow by sending the detected cards to Anthropic’s Claude API for additional OCR processing. Roboflow workflow with Claude API integration This hybrid approach was very effective - the combination of YOLO’s object detection to dynamically crop down the Black Jack table to individual cards with Claude’s advanced vision capabilities yielded 99.9% accuracy on the predicted cards. However, this solution came with a significant drawback: the additional API layer consumed valuable time and the large model’s processing overhead, making it impractical for real-time gameplay. Seeking a faster solution, I implemented the same workflow locally using easyOCR instead. EasyOCR seems to be really good at extracting black text on white background but might struggle with everything else. While it was able to correctly identify the card numbers when it detected them, it struggled to recognize around half of the cards in the first place - even when fed pre-cropped card images directly from the YOLO model. This inconsistency made it unreliable for the application. Rather than continue band-aid solutions, I decided to go back and improve my dataset. I doubled the training data by adding another 60 screenshots with the same train/test split as before. More importantly, I went through all the previous annotations and fixed many of the bounding polygons. I noticed that several misidentifications were caused by the model detecting face-down dealer cards as valid cards, which happened because some annotations for face-up cards inadvertently included parts of the card backs next to them. The improved dataset and cleaned annotations delivered what I was hoping for: The confusion matrix now shows a much cleaner diagonal pattern, indicating that the model now correctly identifies most cards without the cross-contamination issues we saw earlier. Final training results with improved dataset Both the training and validation losses converge smoothly without signs of overfitting, while the precision and recall metrics climb steadily to plateau near perfect scores. The mAP@50 reaches an impressive 99.5%. Most significantly, the confusion matrix now shows that the model has virtually eliminated false positives with background elements. The “background” column (rightmost) in the confusion matrix is now much cleaner, with only minimal misclassifications of actual cards as background noise. Real-time blackjack card detection and odds calculation With the model trained and performing, it was time to deploy it and play some blackjack. Initially, I tested the system using Roboflow’s hosted API, which took around 4 seconds per inference - far too slow for practical gameplay. However, running the model locally on my laptop dramatically improved performance, achieving inference times of less than 0.1 seconds per image (1.3ms preprocess, 45.5ms inference, 0.4ms postprocess per image). I then integrated the model with MSS to capture a real-time feed of my browser window. The system automatically overlays the detected cards with their predicted values and confidence scores Overview of selected fitted curves The final implementation successfully combines the pieces: the computer vision model detects and identifies cards in real-time, feeds this information to the Monte Carlo simulation, and displays both the card recognition results and the calculated odds directly on screen - do not try this at your local (online) casino!

NVIDIA Likes Small Language Models

A Small Language Model (SLM) is a LM that can fit onto a common consumer electronic device and perform inference with latency sufficiently low to be practical when serving the agentic requests of one user. […] We note that as of 2025, we would be comfortable with considering most models below 10bn parameters in size to be SLMs.

The (NVIDIA) researchers argue that most agentic applications perform repetitive, specialized tasks that don’t require the full generalist capabilities of LLMs. They propose heterogeneous agentic systems where SLMs handle most tasks while LLMs are used selectively for complex reasoning. They present three main arguments: (1) SLMs are sufficiently powerful for agentic tasks, as demonstrated by recent models like Microsoft’s Phi series, NVIDIA’s Nemotron-H family, and Hugging Face’s SmolLM2 series, which achieve comparable performance to much larger models while being 10-30x more efficient. (2) SLMs are inherently more operationally suitable for agentic systems due to their faster inference, lower latency, and ability to run on edge devices. (3) SLMs are necessarily more economical, offering significant cost savings in inference, fine-tuning, and deployment.

The paper addresses counterarguments about LLMs’ superior language understanding and centralization benefits with studies (see Appendix B: LLM-to-SLM Replacement Case Studies) showing that 40-70% of LLM queries in popular open-source agents (MetaGPT, Open Operator, Cradle) could be replaced by specialized SLMs. One comment I read raised important concerns about the paper’s analysis, particularly regarding context window which are arguably the highest technical barrier to SLM adoption in agentic systems. Modern agentic applications require substantial context: Claude 4 Sonnet’s system prompt alone reportedly uses around 25k tokens, and a typical coding agent needs system instructions, tool definitions, file context, and project documentation, totaling 5-10k tokens before any actual work begins. Most SLMs that can run on consumer hardware are capped at 32k or 128k contexts architecturally, but achieving reasonable inference speeds at these limits requires gaming hardware (8GB VRAM for a 7b model at 128k context).

The paper concludes that the shift to SLMs is inevitable due to economic and operational advantages, despite current barriers including infrastructure investment in LLM serving, generalist benchmark focus, and limited awareness of SLM capabilities. But the economic efficiency claims also face scrutiny under system-level analysis. In Section 3.2 they present simplistic FLOP comparisons while ignoring critical inefficiencies: the reliance on multishot-prompting where SLMs might require 3-4 attempts for tasks that LLMs complete with 90% success rate, task decomposition overhead that multiplies context setup costs and error rates, and infrastructure efficiency differences between optimized datacenters (PUE ratios near 1.1, >90% GPU utilization) and consumer hardware (5-10% GPU utilization, residential HVAC, 80-85% power conversion efficiency). When accounting for failed attempts, orchestration overhead, and infrastructure efficiency, many “economical” SLM deployments might actually consume more total energy than centralized LLM inference.

(05/07/2025) Update: On the topic of speed I just came across Ultra-Fast Language Models Based on Diffusion. You can also test it yourself using the free playground link, and it is in fact extremely fast. Try the “Diffusion Effect” in the top right corner which toggles an interesting visualization. I’m not sure how realistic this is, it shows text appearing as random noise before gradually resolving into clear words; though the actual process likely involves tokens evolving from imprecise vectors in a multidimensional space toward more precise representations until they crystallize into specific words.

(06/07/2025) Update II: Apparently there is also a Google DeepMind Gemini Diffusion Model.

Novo Nordisk's Post-Patent Strategy

Novo Nordisk, a long time member of my “regrets” stock list, has become reasonably affordable lately (-48% yoy). Part of the reason being that they currently sit atop a ~$20 billion Ozempic/Wegovy franchise that faces patent expiration in 2031. That’s roughly seven years to replace their blockbuster drug. We revisit them today, since per newly published Lancet data, Novo’s lead replacement candidate—amycretin—just posted some genuinely impressive Phase 1 results. The injectable version delivered 24.3% average weight loss versus 1.1% for placebo, beating both current market leaders (Wegovy at 15% and Lilly’s Zepbound at 22.5%). Even the oral version hit 13.1% weight loss in just 12 weeks, with patients still losing weight when the trial ended.

Amycretin is very elegantly designed: It combines semaglutide (the active ingredient in Ozempic/Wegovy) with amylin, creating what’s essentially a dual-pathway satiety signal. Semaglutide activates GLP-1 receptors to slow gastric emptying and reduce appetite centrally, while amylin works through complementary mechanisms to enhance fullness signals. This way both your stomach and your brain’s “appetite control center” are getting the “stop eating” message simultaneously. One concern raised by Elaine Chen at STAT is that the results of a Phase 1/2 study include unusual findings around dosage. The full text article is behind a paywall unfortunately, so I did not have access. However, looking at the actual data from the study, I am assuming she is referring to Parts C, D, and E, which tested maintenance doses of 20 mg, 5 mg, and 1.25 mg respectively. The weight loss results were:

Part C (20 mg): -22.0% weight loss at 36 weeks
Part D (5 mg): -16.2% weight loss at 28 weeks
Part E (1.25 mg): -9.7% weight loss at 20 weeks

While there is a dose-response relationship, what’s notable is the curves in Figure 3 show relatively similar trajectories during the overlapping time periods. Typically in drug development, researchers would expect clear separation between dose groups (with higher doses producing proportionally greater effects). When weight-loss curves overlap significantly (which they do in this case), it suggests the doses may be producing similar effects despite different drug concentrations. If lower doses produce similar weight loss with potentially fewer side effects, this could favor using the lower, better-tolerated dose. Further, it might indicate that amycretin reaches maximum effect at relatively low doses. This should probably influence how future Phase 3 trials are designed, potentially focusing on the optimal dose rather than the maximum tolerated dose. Given that gastrointestinal side effects were dose-dependent but efficacy curves overlapped, this supports using the lowest effective dose. How that might be a bad thing I have yet to find out.

From a financial perspective, Novo Nordisk’s pipeline is very interesting: Amycretin’s injectable version is currently in Phase 2, suggesting Phase 3 trials around 2026-2027, with potential approval by 2031; basically right as the Ozempic patents expire. But Novo isn’t betting everything on amycretin. They’re running what appears to be a diversified pipeline strategy with multiple shots on goal: NNC-0519 (another next-gen GLP-1), NNC-0662 (details kept confidential), and cagrilintide combinations. This makes sense: you want multiple candidates because the failure rate in drug development makes even the most promising compounds statistically likely to fail. Eli Lilly’s tirzepatide (Mounjaro/Zepbound) works through a different mechanism—GLP-1 plus GIP receptor activation—and appears to be gaining market share. Lilly’s orforglipron, an oral GLP-1 that hit 14.7% weight loss in Phase 2, represents another competitive threat. Judging by LLY’s price development, investors currently seem to think that Lilly is doing a better job at architecting a portfolio than Novo (or at least providing more disclosure about their pipeline). Yet, the overall competitive landscape might actually benefit both companies. The “war” between Novo and Lilly is expanding the overall market for obesity treatments, potentially growing the pie faster than either company is losing share. Also, to analyze the financial impact of the expiring Ozempic patents, we have to look further than just Novo’s research pipeline. Manufacturing these GLP-1 compounds and their delivery devices is “pretty tough.” Complex peptides requiring specialized manufacturing capabilities, plus the injection devices themselves are patent-protected. This creates what we would call a capacity constraint moat in corporate strategy. Novo’s manufacturing capabilities/partnerships and injectable device patents are a key competitive advantage. Even when semaglutide goes generic in 2031, the entire generic pharmaceutical industry would essentially need to coordinate to build sufficient manufacturing capacity to meaningfully dent Novo’s market share. Meanwhile, Novo could potentially defend by lowering prices while maintaining manufacturing advantages in a monopoly-to-oligopoly transition.

The other day I came across Martin Shkreli’s NOVO model. Conservatively, it puts Novo’s fair value around 705 DKK (21% upside from ~585 DKK), while a failure scenario drops valuation to 385 DKK. The range reflects what you’d expect for a large-cap pharmaceutical company;the market has already incorporated most knowable information about pipeline risks and patent timelines. This also underscores the point that manufacturing capabilities and continuous innovation pipelines can potentially maintain quasi-monopolistic positions longer than traditional patent protection would suggest. Shkreli’s analysis suggests Novo Nordisk is reasonably valued with modest upside potential, contingent on successful pipeline execution. Novo Nordisk is at a critical juncture, with substantial franchise value dependent on successful pipeline execution over the next 7-8 years. While the current valuation appears reasonable, the binary nature of drug development success creates both upside potential and significant downside risk.

This article is for informational purposes only, you should not consider any information or other material on this site as investment, financial, or other advice. There are risks associated with investing.

Behavioral Economics & Transit Policy

Over the weekend a WSJ editorial on the 2025 New York City mayoral election called one of the potential Democratic candidates Zohran Mamdani “a literal socialist” for - among other things - running on the promise of free bus rides for all:

Zohran won New York’s first fare-free bus pilot on five lines across the city. As Mayor, he’ll permanently eliminate the fare on every city bus […] Fast and free buses will not only make buses reliable and accessible but will improve safety for riders and operators – creating the world-class service New Yorkers deserve.

Free public transit seems to be a recurring idea among politicians: For some reason, making it free feels revolutionary in a way that making it cheaper never could. There’s actually some solid behavioral economics behind this intuition: “Zero as a Special Price: The True Value of Free Products.” (Yes, before the fabricated data scandals, Ariely did write research that has replicated consistently.) The basic finding: people don’t treat “free” as just another very low price. When you price something at zero, it gets a special psychological boost that makes people value it way more than they should based on pure cost-benefit analysis: Give people a choice between a Hershey’s Kiss for 1¢ and a Lindt truffle for 15¢. Most people choose the obviously superior Lindt. Now make it Hershey’s for free versus Lindt for 14¢—keeping the price difference exactly the same—and suddenly everyone wants the Hershey’s. Free doesn’t just eliminate cost; it creates additional perceived value. The mechanism is pure affect. “Free” makes people feel good in a way that “1¢” doesn’t, even though the economic difference is trivial. When you force people to think analytically about the trade-offs, the effect disappears. But in normal decision-making, that warm fuzzy feeling of getting something for nothing dominates rational calculation.

Now think about bus fare policy. The difference between $2.75 and $0 isn’t meaningfully different from the difference between $2.75 and $0.75 for most riders’ budgets. But psychologically? Free transit feels like a gift from the city. Cheap transit feels like commerce. The first activates social norms (gratitude, civic participation, shared ownership). The second activates market norms (cost-benefit analysis, value-for-money calculations, consumer complaints when service is bad). On the other side, any positive price, no matter how small, forces people into analytical mode. They start thinking about trade-offs, evaluating whether the service is worth it, considering alternatives. This is why congestion pricing works so well. A $5 charge to drive in Manhattan (or Singapore, London, Stockholm, Milan, Gothenburg) isn’t going to bankrupt anyone who can afford to drive in Manhattan. But it makes people think about each trip in a way they never did when driving felt “free” (ignoring gas, parking, insurance, etc.). Once you’re thinking analytically rather than just following habit, you’re much more likely to take the subway.

But!! Free transit might actually make it easier to cut transit funding, not harder. Think about it: Right now, when transit agencies face budget cuts, fare-paying riders get angry. They’re customers! They paid for service! They demand value for money! This creates a natural constituency defending transit budgets. Make transit free, and you’ve eliminated that market relationship. Riders become passive beneficiaries rather than paying customers. When service gets worse, they can’t complain about not getting their money’s worth; they’re getting exactly what they paid for. If I were a politician looking to slash subsidies without political blowback, step one would be eliminating fares. Tell everyone it’s about equity and access. Then, once people stop thinking of themselves as customers, start the real cuts. No more late-night service; hey, it’s free! Longer waits, dirtier stations, broken escalators; what did you expect for nothing? The behavioral economics is clear: when something is free, people have lower expectations and less standing to complain. The zero-price effect works both ways. None of this means transit shouldn’t be affordable!

It Just Ain’t So

It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

This (not actually) Mark Twain quote from The Big Short captures the sentiment of realizing that some foundational assumptions might be empirically wrong.

A recent article by Anton Vorobets that I came across in Justina Lee’s Quant Newsletter presents compelling evidence that challenges one of the field’s fundamental statistical assumptions, that asset returns follow normal distributions. Using 26 years of data from 10 US equity indices, he ran formal normality tests (Shapiro-Wilk, D’Agostino’s K², Anderson-Darling) and found that the normal distribution hypothesis gets rejected in most cases. The supposed “Aggregational Gaussianity” that academics invoke through Central Limit Theorem arguments? It’s mostly wishful thinking enabled by small sample sizes. As Vorobets observes:

Finance and economics academia is unfortunately driven by several convenient myths, i.e., claims that are taken for granted and spread among university academics despite their poor empirical support.

The article highlights significant practical consequences for portfolio management and risk assessment. Portfolio optimization based on normal distribution assumptions ignores fat left tails—exactly the kind of extreme downside events that can wipe out portfolios. This misspecification can lead to inadequate risk management and suboptimal asset allocation decisions. Vorobets suggests alternative approaches, including Monte Carlo simulations combined with Conditional Value-at-Risk (CVaR) optimization, which better accommodate the complex distributional properties observed in financial data. While computationally more demanding, these methods offer improved alignment with empirical reality.

Reading this piece gave me a few ideas for extensions I might want to explore in an upcoming personal project: (1) While Vorobets focuses on US equity indices, similar analysis across fixed income, commodities, currencies, and alternative assets would provide a more comprehensive view of distributional properties across financial markets. Each asset class exhibits distinct market microstructure characteristics that may influence distributional behavior. (2) Global Market Coverage: Extending the geographic scope to include developed, emerging, and frontier markets would illuminate whether the documented deviations from normality represent universal phenomena or are specific to US market structures. Cross-regional analysis could reveal important insights about market development, regulatory frameworks, and institutional differences. (3) Building upon Vorobets’ foundation, there are opportunities to incorporate multivariate normality testing, regime-dependent analysis, and time-varying parameter models. Additionally, investigating the power and robustness of different statistical tests across various market conditions would strengthen the methodological contribution. (4) Examining different time horizons, market regimes (pre- and post-financial crisis, COVID period), and potentially higher-frequency data could provide deeper insights into when and why distributional assumptions break down.

Not All AI Skeptics Think Alike

Apple’s recent paper “The Illusion of Thinking” has been widely understood to demonstrate that reasoning models don’t ‘actually’ reason. Using controllable puzzle environments instead of contaminated math benchmarks, they discovered something fascinating: there are three distinct performance regimes when it comes to AI reasoning complexity. For simple problems, standard models actually outperform reasoning models while being more token-efficient. At medium complexity, reasoning models show their advantage. But at high complexity? Both collapse completely. Here’s the kicker: reasoning models exhibit counterintuitive scaling behavior—their thinking effort increases with problem complexity up to a point, then declines despite having adequate token budget. It’s like watching a student give up mid-exam when the questions get too hard, even though they have plenty of time left.

We observe that reasoning models initially increase their thinking tokens proportionally with problem complexity. However, upon approaching a critical threshold—which closely corresponds to their accuracy collapse point—models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty.

The researchers found something even more surprising: even when they provided explicit algorithms—essentially giving the models the answers—performance didn’t improve. The collapse happened at roughly the same complexity threshold. On the other hand, Sean Goedecke is not buying Apple’s methodology: His core objection? Puzzles “require computer-like algorithm-following more than they require the kind of reasoning you need to solve math problems.”

You can’t compare eight-disk to ten-disk Tower of Hanoi, because you’re comparing “can the model work through the algorithm” to “can the model invent a solution that avoids having to work through the algorithm”.

From his own testing, models “decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to even start.” That’s strategic behavior, not reasoning failure. This matters because it shows how evaluation methodology shapes our understanding of AI capabilities. Goedecke argues Tower of Hanoi puzzles aren’t useful for determining reasoning ability, and that the complexity threshold of reasoning models may not be fixed.

Your AI Assistant Might Rat You Out

There was this story going around the past few days

Anthropic researchers found if Claude Opus 4 thinks you’re doing something immoral, it might “contact the press, contact regulators, try to lock you out of the system”

Mostly driven by a Sam Bowman tweet referring to the Claude 4 System Card section 4.1.9 on high-agency behavior. The outrage was mostly by people misunderstanding the prerequisites necessary for such a scenario. Nevertheless, an interesting question emerged: What happens when you feed an AI model evidence of fraud and give it an email tool? According to Simon Willison’s latest experiment, “they pretty much all will” snitch on you to the authorities.

A fun new benchmark just dropped! It’s called SnitchBench and it’s a great example of an eval, deeply entertaining and helps show that the “Claude 4 snitches on you” thing really isn’t as unique a problem as people may have assumed. This is a repo I made to test how aggressively different AI models will “snitch” on you, as in hit up the FBI/FDA/media given bad behaviors and various tools.

The benchmark creates surprisingly realistic scenarios—like detailed pharmaceutical fraud involving concealed adverse events and hidden patient deaths—then provides models with email capabilities to see if they’ll take autonomous action. This reveals something fascinating about AI behavior that goes beyond traditional benchmarks. Rather than testing reasoning or knowledge, SnitchBench probes the boundaries between helpful assistance and autonomous moral decision-making. When models encounter what appears to be serious wrongdoing, do they become digital whistleblowers?

The implications are both reassuring and unsettling. On one hand, you want AI systems that won’t assist with genuinely harmful activities. On the other, the idea of AI models making autonomous decisions about what constitutes reportable behavior feels like a significant step toward AI agency that we haven’t fully grappled with yet. Therefore, Anthropic’s own advice here seems like a good rule to follow:

Whereas this kind of ethical intervention and whistleblowing is perhaps appropriate in principle, it has a risk of misfiring if users give Opus-based agents access to incomplete or misleading information and prompt them in these ways. We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable.

Modeling Glycemic Response with XGBoost

Earlier this year I wrote how I built a CGM data reader after wearing a continuous glucose monitor myself. Since I was already logging my macronutrients and learning more about molecular biology in an MIT MOOC I became curious if given a meal’s macronutrients (carbs, protein, fat) and some basic individual characteristics (age, BMI), these could serve as features in a regressor machine learning model to predict the curve parameters of the postprandial glucose curve (how my blood sugar levels change after eating). I came across a paper on Personalized Nutrition by Prediction of Glycemic Responses which did exactly that. Unfortunately, neither the data nor the code were publicly available. And - I wanted to predict my own glycemic response curve. So I decided to build my own model. In the process I wrote this working paper. Overview of Working Paper Pages The paper represents an exercise in applying machine learning techniques to medical applications. The methodologies employed were largely inspired by Zeevi et al.’s approach. I quickly realized that training a model on my own data only was not very promising if not impossible. To tackle this, I used the publicly available Hall dataset containing continuous glucose monitoring data from 57 adults, which I narrowed down to 112 standardized meals from 19 non-diabetic subjects with their respective glucose curve after the meal (full methodology in the paper). Overview of the CGM pipeline workflow Rather than trying to predict the entire glucose curve, I simplified the problem by fitting each postprandial response to a normalized Gaussian function. This gave me three key parameters to predict: amplitude (how high glucose rises), time-to-peak (when it peaks), and curve width (how long the response lasts). Overview of single fitted curve of cgm measurements The Gaussian approximation worked surprisingly well for characterizing most glucose responses. While some curves fit better than others, the majority of postprandial responses were well-captured, though there’s clear variation between individuals and meals. Some responses were high amplitude, narrow width, while others are more gradual and prolonged. Overview of selected fitted curves I then trained an XGBoost regressor with 27 engineered features including meal composition, participant characteristics, and interaction terms. XGBoost was chosen for its ability to handle mixed data types, built-in feature importance, and strong performance on tabular data. The pipeline included hyperparameter tuning with 5-fold cross-validation to optimize learning rate, tree depth, and regularization parameters. Rather than relying solely on basic meal macronutrients, I engineered features across multiple categories and implemented CGM statistical features calculated over different time windows (24-hour and 4-hour periods), including time-in-range and glucose variability metrics. Architecture wise, I trained three separate XGBoost regressors - one for each Gaussian parameter.

While the model achieved moderate success predicting amplitude (R² = 0.46), it completely failed at predicting timing - time-to-peak prediction was essentially random (R² = -0.76), and curve width prediction was barely better (R² = 0.10). Even the amplitude prediction, while statistically significant, falls well short of an R² > 0.7. Studies that have achieved better predictive performance typically used much larger datasets (>1000 participants). For my original goal of predicting my own glycemic responses, this suggests that either individual-specific models trained on extensive personal data, or much more sophisticated approaches incorporating larger training datasets, would be necessary.

The complete code, Jupyter notebooks, processed datasets, and supplementary results are available in my GitHub repository.
_ _

(10/06/2025) Update: Today I came across Marcel Salathé’s LinkedIn post on a publication out of EPFL: Personalized glucose prediction using in situ data only.

With data from over 1,000 participants of the Food & You digital cohort, we show that a machine learning model using only food data from myFoodRepo and a glucose monitor can closely track real blood sugar responses to any meal (correlation of 0.71).

As expected Singh et. al. achieve a substantially better predictive performance (R = 0.71 vs R² = 0.46). Besides probably higher methodological rigor and scientific quality, the most critical difference is sample size - their 1'000+ participants versus my 19 participants (from the Hall dataset) represents a fundamental difference in statistical power and generalizability. They addressed one of the shortcomings I faced by leveraging a large digital nutritional cohort from the “Food & You” study (including high-resolution data of nutritional intake of more than 46 million kcal collected from 315'126 dishes over 23'335 participant days, 1'470'030 blood glucose measurements, 49'110 survey responses, and 1'024 samples for gut microbiota analysis).

Apart from that I am excited to - at a first glance - observe the following similarities: (1) Both aim to predict postprandial glycemic responses using machine learning, with a focus on personalized nutrition applications. (2) Both employ XGBoost regression as their primary predictive algorithm and use similar performance metrics (R², RMSE, MAE, Pearson correlation). (3) Both extract comprehensive feature sets including meal composition (macronutrients), temporal features, and individual characteristics. (4) Both use mathematical approaches to characterize glucose responses - I used Gaussian curve fitting, while Singh et. al. use incremental area under the curve (iAUC). (5) Both employ cross-validation techniques for model evaluation and hyperparameter tuning. (6) SHAP Analysis: Both use SHAP for model interpretability and feature importance analysis.

Gambling vs. Investing

Kalshi, a prediction market startup, is using its federal financial license to offer sports betting nationwide, even in states where it’s not legal. The move has earned them cease-and-desist letters from state gaming regulators, but CEO Tarek Mansour isn’t backing down:

We can go one by one for every financial market and it would fall under the definition of gambling. So what’s the difference?

It’s a question that cuts to the heart of modern finance. The founders argue that Wall Street blurred the line between investing and gambling long ago, and casting Kalshi as the latter is inconsistent at best. They have a point—if you can bet on oil futures, Nvidia’s stock price, or interest rate movements, why is wagering on NFL touchdowns more objectionable?

Benefiting from the Trump administration’s hands-off regulatory approach, with the CFTC dropping its legal challenge to their election contracts, the odds might be in their favor. Even better, a Kalshi board member is awaiting confirmation to lead the very agency that was previously their biggest antagonist.

The technical distinction matters: Kalshi operates as an exchange between traders rather than a house taking bets against customers. But functionally, with 79% of their recent trading volume being sports-related, they’re forcing us to confront an uncomfortable reality about risk, speculation, and what we choose to call “investing.”

Whether you call it innovation or regulatory arbitrage, Kalshi is exposing the arbitrary nature of the lines we’ve drawn around acceptable financial speculation.
_ _

(17/06/2025) Update: Matt Levine - one of the finance columnists I enjoy reading most - just published a long piece “It’s Not Gambling, It’s Predicting” in his newsletter on exactly this issue:

Kalshi offers a prediction market where you can bet on sports. No! Sorry! Wrong! It offers a prediction market where you can predict which team will win a sports game, and if you predict correctly you make money, and if you predict incorrectly you lose money. Not “bet on sports.” “Predict sports outcomes for money.” Completely different.