Projects

Against All Odds: The Mathematics of 'Provably Fair' Casino Games


Gambling can be harmful and lead to significant losses. Participation is subject to local laws and age restrictions. Always gamble responsibly. Need help? Visit BeGambleAware.org


Crash games represent a category of online gambling where players place bets on an increasing multiplier that can ‘crash’ at any moment. The fundamental mechanic requires players to cash out before the crash occurs; successful cash-outs yield the bet amount multiplied by the current multiplier, while failure results in total loss of the wager.

Crash game showing an airplane flying with increasing multiplier until it crashes Crash game showing an airplane flying with increasing multiplier until it crashes

The specific game I came across is a variant that employs an aircraft flight metaphor. Let’s call it Plane Game. What intrigued me wasn’t the game itself but that it said “provably fair” on the startup screen, which I assumed to be a typo at first. I stand corrected:

A provably fair gambling system uses cryptography to let players verify that each outcome was generated from fixed inputs, rather than chosen or altered by the operator after a bet is placed. The casino commits to a hidden “server seed” via a public hash, combines it with a player-controlled “client seed” and a per-bet nonce, and later reveals the server seed so anyone can recompute and confirm the result.

The stated Return-to-Player (RTP) of that specific game is 97%, implying a 3% house edge. After watching a few rounds, the perceived probability felt off. And if there’s something that gets my attention, it’s the combination of games and statistics. So I did what any reasonable person would do: I watched another 20,000 rounds over six days (112 hours total) and wrote a paper about it. Script recording 20000 rounds over six days (112 hours total) Script recording 20000 rounds over six days (112 hours total)

The distribution below shows the classic heavy tail: most rounds crash quickly at low multipliers, while rare events produce 100x or even 1000x payouts. The maximum I observed was 10,000x. This extreme variance creates the illusion of big wins just around the corner while the house edge operates relentlessly over time. Heavy-tailed distribution of crash multipliers on log-log scale showing most rounds end at low multipliers while rare events exceed 100x or 1000x, with maximum observed at 10,000x Heavy-tailed distribution of crash multipliers on log-log scale showing most rounds end at low multipliers while rare events exceed 100x or 1000x, with maximum observed at 10,000x For a crash game with RTP = r (where 0 < r < 1), the crash multiplier M follows a specific probability distribution. The survival function is particularly relevant:

$$P(M \geq m) = \frac{r}{m}$$

This means the probability of reaching at least multiplier m before crashing equals r/m. For any cash-out target, the expected value of a unit bet works out to:

$$E[\text{Profit}] = P(M \geq m) \times m - 1 = \frac{r}{m} \times m - 1 = r - 1 = -0.03$$

This mathematical property makes crash games theoretically “strategy-proof” in expectation. No cash-out timing strategy should yield better long-term results than another. Survival probability curve on log-log scale showing probability of reaching target multiplier: 2x succeeds 48.5% of the time, 5x at 19.6%, 10x at 9.7%, 50x at 2.0%, and 100x at just 1.1% Survival probability curve on log-log scale showing probability of reaching target multiplier: 2x succeeds 48.5% of the time, 5x at 19.6%, 10x at 9.7%, 50x at 2.0%, and 100x at just 1.1% The empirical data matches theory almost perfectly. A 2x target succeeds about 48.5% of the time. Aiming for 10x? That works only 9.7% of rounds. The close fit between my observations and the theoretical line confirms the stated 97% RTP.

So is the game fair? My analysis says yes. Using three different statistical methods (log-log regression, maximum likelihood, and the Hill estimator), I estimated the probability density function exponent at α ≈ 1.98, within 2.2% of the theoretical value of 2.0. This contrasts with Wang and Pleimling’s 2019 research that found exponents of 1.4 to 1.9 for player cashout distributions. The key distinction: their deviations reflect player behavioral biases (probability weighting), not game manipulation. The random number generator produces fair outcomes. Q-Q plot comparing empirical vs theoretical quantiles with perfect fit line and 10% confidence band, showing close alignment confirming fair random number generation Q-Q plot comparing empirical vs theoretical quantiles with perfect fit line and 10% confidence band, showing close alignment confirming fair random number generation I then ran Monte Carlo simulations of 10,000 betting sessions under four different strategies: conservative 1.5x cashouts, moderate 2.0x, aggressive 3.0x, and high-risk 5.0x targets. Strategy comparison boxplot showing session returns for 100 rounds: 1.5x Conservative averages -2.9%, 2.0x Moderate -2.4%, 3.0x Aggressive -3.3%, and 5.0x High Risk -3.5%, all negative Strategy comparison boxplot showing session returns for 100 rounds: 1.5x Conservative averages -2.9%, 2.0x Moderate -2.4%, 3.0x Aggressive -3.3%, and 5.0x High Risk -3.5%, all negative Every single strategy produces negative expected returns. The conservative approach has lower variance but still loses. The aggressive strategies lose faster with higher variance. Simulated player sessions using 1.5x strategy over 200 rounds showing multiple trajectories trending toward expected loss line of -3% per round Simulated player sessions using 1.5x strategy over 200 rounds showing multiple trajectories trending toward expected loss line of -3% per round The consumer protection angle is what concerns me most. My data revealed 179 rounds per hour with 16-second median intervals. At that pace, with a 3% house edge per round, players face expected losses exceeding 500% of amounts wagered per hour of play. The manual cashout mechanic creates an illusion of control, masking the deterministic nature of losses.

The game is provably fair in the cryptographic sense. The mathematics check out. But mathematical fairness doesn’t ensure consumer safety. The house always wins, and it wins fast.

The only winning strategy is not to play

The full paper preprint with methodology and statistical details is available on SSRN. Code and data are on GitHub.

Social Media Success Prediction: BERT Models for Post Titles

Last week I published a Hacker News title sentiment analysis based on the Attention Dynamics in Online Communities paper I have been working on. The discussion on Hacker News raised the obvious question: can you actually predict what will do well here? Hacker News Frontpage Hacker News Frontpage The honest answer is: partially. Timing matters. News cycles matter. Who submits matters. Weekend versus Monday morning matters. Most of these factors aren’t in the title. But titles aren’t nothing either. “Show HN” signals something. So does phrasing, length, and topic selection. The question becomes: how much signal can you extract from 80 characters?

Hacker News (HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator.

This isn’t new territory. Max Woolf built a Reddit submission predictor back in 2017, and ontology2 trained an HN classifier using logistic regression on title words. Both found similar ceilings; around 0.76 AUC with classical approaches. I wanted to see what modern transformers could add.

The baseline was DistilBERT, fine-tuned on 90,000 HN posts. ROC AUC of 0.654, trained in about 20 minutes on a T4 GPU. Not bad for something that only sees titles. Then RoBERTa with label smoothing pushed it to 0.692. Progress felt easy. ROC curve comparing model versions ROC curve comparing model versions What if sentence embeddings captured something classification heads missed? I built an ensemble: SBERT for semantic features, RoBERTa for discrimination, weighted average at the end. The validation AUC jumped to 0.714.

The problem was hiding in the train/test split. I’d used random sampling. HN has strong temporal correlations: topics cluster, writing styles evolve, news cycles create duplicates. A random split let the model see the future. SBERT’s semantic embeddings matched near-duplicate posts across the split perfectly.

When I switched to a strict temporal split, training on 2022-early 2024 and testing on late 2024 onward, the ensemble dropped to 0.693. More revealing: the optimal SBERT weight went from 0.35 to 0.10. SBERT was contributing almost nothing. The model had memorized temporal patterns, not learned to predict. Calibration plot showing predicted vs actual probabilities Calibration plot showing predicted vs actual probabilities I kept RoBERTa, added more regularization, dropped from 0.1 to 0.2 dropout, weight decay from 0.01 to 0.05, froze the lower six transformer layers. The model got worse at fitting training data. Train AUC dropped from 0.803 to 0.727.

But the train-test gap collapsed from 0.109 to 0.042. That’s a 61% reduction in overfitting. Test AUC of 0.685 versus the ensemble’s 0.693, a difference that vanishes once you account for confidence intervals. And now inference runs on a single model, half the latency, no SBERT dependency, 500MB instead of 900MB. Model version comparison showing evolution from V1 to V7 Model version comparison showing evolution from V1 to V7 Prediction scores by content category Prediction scores by content category The other lesson was calibration. A model that says 0.8 probability should mean “70% of posts I give this score actually hit 100 points.” Neural networks trained on cross-entropy don’t do this naturally. They’re overconfident. I used isotonic regression on the validation set to fix the mapping. Expected calibration error (ECE) measures this gap:

$$ECE = \sum_{b=1}^{B} \frac{n_b}{N} \left| \text{acc}(b) - \text{conf}(b) \right|$$

where you bin predictions by confidence, then measure how far off the actual accuracy is from the predicted confidence in each bin. ECE went from 0.089 to 0.043. Now when the model says 0.4, it’s telling the truth.

In practice, the model provides meaningful lift. If you only look at the top 10% of predictions by score, 62% of them are actual hits, roughly 1.9x better than random selection: Lift analysis showing precision at different thresholds Lift analysis showing precision at different thresholds Calibration error distribution Calibration error distribution About training speed: I used the NVIDIA H100 GPU, which runs around 18x more expensive than the T4 per hour on hosted (Google Colab) runtimes. A sensible middle ground would be an A100 (40 or 80GB VRAM) or L4, training 3-5x faster than T4, maybe 5-7 minutes instead of 20-30. But watching epochs fly by at ~130 iterations per second after coming from T4’s ~3 iterations per second was a different experience. Colab notebook showing H100 training at 130 it/s Colab notebook showing H100 training at 130 it/s The model learned some intuitive patterns. “Show HN” titles score higher. Deep technical dives do well. Generic news aggregation doesn’t. Titles between 40-80 characters perform better than very short or very long ones. Some of this probably reflects real engagement patterns. Some of it is noise the model hasn’t been sufficiently regularized to ignore. Model performance by title length Model performance by title length

Running a few titles through the model shows what it picks up on: Title workshop showing model predictions for different phrasings Title workshop showing model predictions for different phrasings Vague claims score low. Specificity helps. First-person “I built” framing does well, which matches what actually gets upvoted. The model isn’t learning to game HN; it’s learning what HN already rewards.

The model now runs, scoring articles in an RSS reader pipeline I built. Does it help? Mostly. I still click on things marked low probability. But the high-confidence predictions are usually right. It’s a filter, not an oracle. RSS reader dashboard showing HN prediction scores RSS reader dashboard showing HN prediction scores

Model on HuggingFace — Download the weights and run inference locally
RSS Reader Pipeline — Full scoring pipeline with feed aggregation
Training Notebook — Colab-ready notebook with the complete training code

On a side note: The patterns here aren’t specific to Hacker News or online communities. Temporal leakage shows up whenever you’re predicting something that evolves over time: credit defaults, client churn, market regimes. The fix is the same: validate on future data, not random holdouts. Calibration matters anywhere probabilities drive decisions. A loan approval model that says “70% chance of repayment” needs that number to mean something. Overfitting to training data is how banks end up with models that look great in backtests and fail in production.

I’ve built similar systems for other domains: sentiment-based trading signals, glycemic response prediction, portfolio optimization. The ML fundamentals transfer. What changes is the domain knowledge needed to avoid the obvious mistakes, like training on data that wouldn’t have been available at prediction time, or trusting metrics that don’t reflect real-world performance.

65% of Hacker News Posts Have Negative Sentiment, and They Outperform

Posts with negative sentiment average 35.6 points on Hacker News. The overall average is 28 points. That’s a 27% performance premium for negativity. Distribution of sentiment scores across 32,000 Hacker News posts Distribution of sentiment scores across 32,000 Hacker News posts This finding comes from an empirical study I’ve been running on HN attention dynamics, covering decay curves, preferential attachment, survival probability, and early-engagement prediction. The preprint is available on SSRN. I already had a gut feeling. Across 32,000 posts and 340,000 comments, nearly 65% register as negative. This might be a feature of my classifier being miscalibrated toward negativity; yet the pattern holds across six different models. Sentiment distribution comparison across DistilBERT, BERT Multi, RoBERTa, Llama 3.1 8B, Mistral 3.1 24B, and Gemma 3 12B Sentiment distribution comparison across DistilBERT, BERT Multi, RoBERTa, Llama 3.1 8B, Mistral 3.1 24B, and Gemma 3 12B I tested three transformer-based classifiers (DistilBERT, BERT Multi, RoBERTa) and three LLMs (Llama 3.1 8B, Mistral 3.1 24B, Gemma 3 12B). The distributions vary, but the negative skew persists across all of them (inverted scale for 2-6). The results I use in my dashboard are from DistilBERT because it runs efficiently in my Cloudflare-based pipeline.

What counts as “negative” here? Criticism of technology, skepticism toward announcements, complaints about industry practices, frustration with APIs. The usual. It’s worth noting that technical critique reads differently than personal attacks; most HN negativity is substantive rather than toxic. But, does negativity cause engagement, or does controversial content attract both negative framing and attention? Probably some of both.


Related to this, I also saw this Show HN: 22GB of Hacker News in SQLite, served via WASM shards. Downloaded the HackerBook export and ran a subset of my paper’s analytics on it.

Caveat: HackerBook is a single static snapshot (no time-series data). Therefore I could not analyze lifecycle analysis, early-velocity prediction, or decay fitting. What can be computed: distributional statistics, inequality metrics, circadian patterns.

Summary statistics Summary statistics

Score distribution (CCDF + power-law fit) Score distribution (CCDF) with power-law fit on HackerBook shard sample Score distribution (CCDF) with power-law fit on HackerBook shard sample

Attention inequality (Lorenz curve + Gini) Lorenz curve of story scores (attention inequality) with sample Gini Lorenz curve of story scores (attention inequality) with sample Gini

Circadian patterns (volume vs mean score, UTC) Circadian patterns on Hacker News (UTC): posting volume vs mean score Circadian patterns on Hacker News (UTC): posting volume vs mean score

Score vs direct comments (proxy) Score vs direct comments (proxy from reply edges), log-log scatter Score vs direct comments (proxy from reply edges), log-log scatter

Direct comments distribution (CCDF, proxy) Direct comments distribution (proxy) shown as CCDF Direct comments distribution (proxy) shown as CCDF

Mean score vs direct comments (binned, proxy) Mean score vs direct comments (proxy), binned in log-spaced buckets Mean score vs direct comments (proxy), binned in log-spaced buckets

RSS Swipr: Find Blogs Like You Find Your Dates

GIF with interactive demo of the RSS Tinder App GIF with interactive demo of the RSS Tinder App Algorithmic timelines are everywhere now. But I still prefer the control of RSS. Readers are good at aggregating content but bad at filtering it. What I wanted was something borrowed from dating apps: instead of an infinite list, give me cards. Swipe right to like, left to dislike. Then train a model to surface what I actually want to read. So I built RSS Swipr.

The frontend is vanilla JavaScript—no React, no build steps, just DOM manipulation and CSS transitions. You drag a card, it follows your finger, and snaps away with a satisfying animation. Behind the scenes, the app tracks everything: votes (like/neutral/dislike), time spent viewing each card, and whether you actually opened the link. If I swipe right but don’t click through, that’s a signal. If I spend 0.3 seconds on a card before swiping left, that’s a signal too. Feed management interface showing 1084 imported RSS feeds with 9327 total entries Feed management interface showing 1084 imported RSS feeds with 9327 total entries Feed management happens through a simple CSV import. Paste a list of name,url pairs, click refresh, and the fetcher pulls articles with proper HTTP caching (ETag/Last-Modified) to avoid hammering servers. You can use your own feed list or load a predefined list. Thanks to Manuel Moreale who created blogroll I was able to get an OPML export and load all curated RSS feeds directly. Something similar works with minifeed or Kagi’s smallweb. Or you use one of the Hacker News RSS feeds. If that feels too adventurous, I created curated feeds for the most popular HN bloggers.

Building the model, I started with XGBoost and some hand-engineered features (title length, word count, time of day, feed source). Decent—around 66% ROC-AUC. It learned that I dislike short, clickbaity titles. But it didn’t understand context.

The upgrade was MPNet (all-mpnet-base-v2 from sentence-transformers) to generate 768-dimensional embeddings for every article’s title and description. Combined with engineered features—feed preferences, temporal patterns, text statistics—this gets fed into a Hybrid Random Forest.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def predict_preference(article):
    # Generate semantic embeddings (768 dims)
    embeddings = mpnet.encode(f"{article.title} {article.description}")

    # Extract behavioral + text features
    features = feature_pipeline.transform(article)

    # Predict with Hybrid RF
    X = np.hstack([embeddings, features])
    return model.predict_proba(X)

Training happens on Google Colab (free T4 GPU or even faster with H100 or A100 on a subscription). Upload your training CSV, run the notebook, download a .pkl file. Google Colab notebook showing model training setup with GPU configuration Google Colab notebook showing model training setup with GPU configuration The notebook handles everything: installing sentence-transformers, downloading the feature engineering pipeline, checking GPU availability, and running 5-fold cross-validation. Training results showing ROC-AUC of 0.7537 across 5-fold cross-validation Training results showing ROC-AUC of 0.7537 across 5-fold cross-validation With ~1400 training samples, the model achieves 75.4% ROC-AUC (± 0.019 std). Not state-of-the-art, but enough to noticeably improve my reading experience. The model now understands that I like systems programming and ML papers, but skip most crypto and generic startup advice.

The problem with transformer models is latency. Generating MPNet embeddings takes ~1 second per article. In a swipe interface, that lag is unbearable. The next best thing is a preload queue. While you’re reading the current card, the backend is scoring and fetching the next 3-5 articles in the background. By the time you swipe, the next card is already waiting.

1
2
3
4
5
6
async loadNextBatch() {
    const excludeIds = this.cardQueue.map(c => c.id).join(',');
    const response = await fetch(`/api/posts/batch?count=3&exclude=${excludeIds}`);
    const data = await response.json();
    this.cardQueue.push(...data.posts);
}

Article selection uses Thompson Sampling: 80% of the time it shows what the model thinks you’ll like (exploit), 20% it throws in something unexpected (explore). This prevents the filter bubble problem and lets the model discover if your tastes have changed.

The whole system is designed as a closed loop:

  1. Swipe → votes get stored in SQLite
  2. Export → download training CSV with votes + engagement data
  3. Train → run Colab notebook, get new model
  4. Upload → drag-drop the .pkl file back into the app

Export interface showing 1421 votes with breakdown: 583 likes, 193 neutral, 645 dislikes Export interface showing 1421 votes with breakdown: 583 likes, 193 neutral, 645 dislikes The export includes everything the model needs: article text, feed metadata, your votes, link opens, and time spent. You can also import a previous training CSV to restore your voting history on a fresh install—useful if you want to clone the repo on a new machine without losing your data. Model management interface showing active hybrid_rf model with ROC-AUC 0.7537 Model management interface showing active hybrid_rf model with ROC-AUC 0.7537

Uploaded models show their ROC-AUC score so you can compare performance across training runs. Activate whichever one works best.

Backend: Python, Flask, SQLite Frontend: Vanilla JS, CSS variables ML: scikit-learn, XGBoost, sentence-transformers (MPNet) Training: Google Colab (free GPU tier)

Total infrastructure cost: zero. Everything runs locally. No accounts, no cloud dependencies, no tracking.

1
2
3
4
5
git clone https://github.com/philippdubach/rss-swipr.git
cd rss-swipr
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python app.py

The full source and Colab notebook are available on GitHub.

Building a No-Tracking Newsletter from Markdown to Distribution

Screenshot of rendered newsletter showing article preview cards with images and descriptions Screenshot of rendered newsletter showing article preview cards with images and descriptions Friends have been asking how they can stay up to date with what I’m working on and keep track of the things I read, write, and share. RSS feeds don’t seem to be en vogue anymore, apparently. So I built a mailing list. What else would you do over the Christmas break?

From a previous marketing job I knew Mailchimp. Also, every newsletter I unsubscribe from is Mailchimp. I no longer wish to receive these emails. Unsubscribe confirmation from Mailchimp newsletters Unsubscribe confirmation from Mailchimp newsletters Or obviously Substack. I read Simon Willison’s Newsletter sometimes. And obviously Michael Burry’s $379 Substack. Those are solid options, but I had a clear picture in mind of what I wanted. I wanted only HTML, no tracking (also why I use GoatCounter on my site and not Google Analytics), and full control of the creation and distribution chain from end to end. So I sat down and drew into my notebook, what I always do when I have an idea after a long walk or a hot shower. Hand-drawn notebook sketch of newsletter architecture showing markdown to HTML to distribution flow Hand-drawn notebook sketch of newsletter architecture showing markdown to HTML to distribution flow I then went over to Illustrator (actually Affinity Designer, which I have been happily using since my Creative Cloud subscription ran out, sorry Adobe) and built a quick mockup of my drawing. I fed the mockup to Claude to generate pure HTML. After a few iterations it more or less looked like I wanted it to be.

The architecture: write the newsletter in Markdown (as I do for all of my blog). Render it as HTML. Fetch OpenGraph images from my Cloudflare CDN at the lowest feasible resolution and pull descriptions automatically. Format links with preview cards. Keep some space for freetext at the top and bottom. Flowchart showing newsletter pipeline: Write Markdown, Render HTML, Host on R2, Fetch KV for subscribers, Send via Resend API Flowchart showing newsletter pipeline: Write Markdown, Render HTML, Host on R2, Fetch KV for subscribers, Send via Resend API I built a Python engine that renders my .md files to email-safe HTML. The script handles several things automatically: (1) It fetches OpenGraph metadata for every link using Beautiful Soup, caching results to avoid repeated requests. (2) optimizes images using Cloudflare’s image transformation service. For email, I use 240px width (2x the display size of 120px for retina displays). (3) It generates LinkedIn-style preview cards with images on the left and text on the right. The output is table-based HTML because email clients from 2003 still exist and they’re apparently immortal. Screenshot of rendered newsletter showing article preview cards with images and descriptions Screenshot of rendered newsletter showing article preview cards with images and descriptions Originally I intended to manually copy-paste the HTML into an email and send it out since I did not expect many subscribers at first (or at all). But I had another challenge at hand: how do people sign up?

Since I had already been using Cloudflare Workers KV to build an API with historic values of my temperature and humidity sensor at home, I resorted to that. The API is simple. POST to /api/subscribe with an email address, and it gets stored in KV with a timestamp and some metadata.

After some Copilot iterations (I’m not a security guy, so not sure how I feel about handing all the security and testing to an agent, please reach out if you can help) the Worker includes rate limiting, honeypot fields for spam protection, proper CORS headers, and RFC-compliant email validation.

I then wanted to get a confirmation email every time someone signed up. Since SMTP sending over my domain did not work reliably at first, I had to look for other options. Even though I wanted everything self-hosted, I ended up using the Resend API. The API is straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
async function sendWelcomeEmail(subscriberEmail: string, env: Env) {
    const response = await fetch('https://api.resend.com/emails', {
        method: 'POST',
        headers: {
            'Authorization': `Bearer ${env.RESEND_API_KEY}`,
            'Content-Type': 'application/json',
        },
        body: JSON.stringify({
            from: 'Philipp Dubach <[email protected]>',
            to: [subscriberEmail],
            subject: 'Welcome to the Newsletter',
            html: `<p>Thanks for subscribing!</p>`,
        }),
    });
    return response.ok;
}

After implementing this, I figured: why not send a confirmation to the subscriber and a copy to me? Why not use Resend for the whole distribution? (This is not a paid advertisement.) The HTML newsletter I generate goes straight into the email body. No images hosted elsewhere (except for the optimized preview thumbnails). No tracking pixels. No click tracking. The email is just HTML.

I also looked at Mailgun and SendGrid before settling on Resend. Mailgun has better deliverability monitoring but a more complex API. SendGrid has more features but felt overengineered for what I needed. Resend’s free tier and simple API won. If you have strong opinions on email APIs, I’m curious to hear them.

The total cost of running this: zero. Cloudflare Workers has a generous free tier. Cloudflare R2 (where the HTML newsletters are hosted) has 10GB free storage. Resend gives 3,000 emails per month. The Python script runs locally or on my Azure instance.

You can find my first newsletter here. The full code for both the newsletter generator and the subscriber API is on GitHub. Needless to say, I would be delighted if we keep in touch through my mailing list:

Deploying to Production with AI Agents: Testing Cursor on Azure

I’ve been curious about Cursor’s capabilities for a while, but never had a good reason to try it. This weekend I decided to host my own URL shortener and deployed YOURLS, a free and open-source link shortener, on a fresh Azure VM. It seemed like a solid test case since it involves SSH access, server configuration, database setup, and SSL certificates. If an AI assistant could handle that end-to-end, it would be genuinely useful.

I was honestly surprised. Cursor didn’t just write commands it connected via SSH, navigated the server, installed dependencies, configured Apache virtual hosts, set up MySQL, and handled the SSL certificate setup. It made sensible decisions about file permissions, security settings, and configuration details. When I asked for a custom YOURLS plugin to add date prefixes to short URLs, it built it on the first try. The whole build and deployment took about 15 minutes, which previously took me at least an hour of manual work and troubleshooting.

The URL shortener is now live and working. You can find this article at pdub.click/2511308. I made the full scrubbed transcript available if you want to see exactly how Cursor handled each step. If you want to do this installation yourself, I wrote a step-by-step tutorial covering the entire process, or you might as well let Cursor do it.

Right after finishing, I closed my laptop and went to clean my bathroom. This reminded me of Joanna Maciejewska’s quote:

I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do laundry and dishes.

Visualizing Gradients with PyTorch

Gradients are one of the most important concepts in calculus and machine learning, but it’s often poorly understood. Trying to understand them better myself, I wanted to build a visualization tool that helps me develop the correct mental picture of what the gradient of a function is. I came across GistNoesis/VisualizeGradient, so I went on from there to write my own iteration. This mental model generalizes beautifully to higher dimensions and is the foundation for understanding optimization algorithms like gradient descent. 2D Gradient Plot: The colored surface shows function values. Black arrows show gradient vectors in the input plane (x-y space), pointing toward the direction of steepest ascent. 2D Gradient Plot: The colored surface shows function values. Black arrows show gradient vectors in the input plane (x-y space), pointing toward the direction of steepest ascent. The colored surface shows function values. Black arrows show gradient vectors in the input plane (x-y space), pointing toward the direction of steepest ascent.

If you are interested in having a closer look or replicating my approach, the full project can be found on my GitHub. I’m also looking forward to doing something similar on the Central Limit Theorem as well as doing a short tutorial on plotting options volatility surfaces with python, a project I have been waiting to finish for some time now.

Counting Cards with Computer Vision

After installing Claude Code

the agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster through natural language commands

I was looking for a task to test its abilities. Fairly quickly we wrote less than 200 lines of python code predicting blackjack odds using Monte Carlo simulation. When I went on to test this little tool on Washington Post’s online blackjack (I also didn’t know that existed!) I quickly noticed how impractical it was to manually input all the card values on the table. What if the tool could also handle blackjack card detection automatically and calculate the odds from it? I have never done anything with computer vision so this seemed like a good challenge. alt text here alt text here To get to any reasonable result we have to start with classification where we “teach” the model to categorize data by showing them lots of examples with correct labels. But where do the labels come from? I manually annotated 409 playing cards across 117 images using Roboflow Annotate (at first I only did half as much - why this wasn’t a good idea we’ll see in a minute). Once enough screenshots of cards were annotated we can train the model to recognize the cards and predict card values on tables it has never seen before. I was able to use a NVIDIA T4 GPU inside Google Colab which offers some GPU time for free when capacity is available. alt text here alt text here During training, the algorithm learns patterns from this example data, adjusting its internal parameters millions of times until it gets really good at recognizing the differences between categories (in this case different cards). Once trained, the model can then make predictions on new, unseen data by applying the patterns it learned. With the annotated dataset ready, it was time to implement the actual computer vision model. I chose to run inference on Ultralytics’ YOLOv11 pre-trained model, a leading object detection algorithm. I set up the environment in Google Colab following the “How to Train YOLO11 Object Detection on a Custom Dataset” notebook. After extracting the annotated dataset from Roboflow, I began training the model using the pre-trained YOLOv11s weights as a starting point. This approach, called transfer learning, allows the model to reuse patterns already learned from millions of general images and adapt them to this specific task. I initially set it up to run for 350 epochs, though the model’s built-in early stopping mechanism kicked in after 242 epochs when no improvement was observed for 100 consecutive epochs. The best results were achieved at epoch 142, taking around 13 minutes to complete on the Tesla T4 GPU. The initial results were quite promising, with an overall mean Average Precision (mAP) of 80.5% at IoU threshold 0.5. Most individual card classes achieved good precision and recall scores, with only a few cards like the 6 and Queen showing slightly lower precision values. Training results showing confusion matrix and loss curves Training results showing confusion matrix and loss curves However, looking at the confusion matrix and loss curves revealed some interesting patterns. While the model was learning effectively (as shown by the steadily decreasing loss), there were still some misclassifications between similar cards, particularly among the numbered cards. This highlighted exactly why I mentioned earlier that annotating only half the amount of data initially “wasn’t a good idea” - more training examples would likely improve these edge cases and reduce confusion between similar-looking cards. My first attempt at solving the remaining accuracy issues was to add another layer to the workflow by sending the detected cards to Anthropic’s Claude API for additional OCR processing. Roboflow workflow with Claude API integration Roboflow workflow with Claude API integration This hybrid approach was very effective - the combination of YOLO’s object detection to dynamically crop down the Black Jack table to individual cards with Claude’s advanced vision capabilities yielded 99.9% accuracy on the predicted cards. However, this solution came with a significant drawback: the additional API layer consumed valuable time and the large model’s processing overhead, making it impractical for real-time gameplay. Seeking a faster solution, I implemented the same workflow locally using easyOCR instead. EasyOCR seems to be really good at extracting black text on white background but might struggle with everything else. While it was able to correctly identify the card numbers when it detected them, it struggled to recognize around half of the cards in the first place - even when fed pre-cropped card images directly from the YOLO model. This inconsistency made it unreliable for the application. Rather than continue band-aid solutions, I decided to go back and improve my dataset. I doubled the training data by adding another 60 screenshots with the same train/test split as before. More importantly, I went through all the previous annotations and fixed many of the bounding polygons. I noticed that several misidentifications were caused by the model detecting face-down dealer cards as valid cards, which happened because some annotations for face-up cards inadvertently included parts of the card backs next to them. The improved dataset and cleaned annotations delivered what I was hoping for: The confusion matrix now shows a much cleaner diagonal pattern, indicating that the model now correctly identifies most cards without the cross-contamination issues we saw earlier. Final training results with improved dataset Final training results with improved dataset Both the training and validation losses converge smoothly without signs of overfitting, while the precision and recall metrics climb steadily to plateau near perfect scores. The mAP@50 reaches an impressive 99.5%. Most significantly, the confusion matrix now shows that the model has virtually eliminated false positives with background elements. The “background” column (rightmost) in the confusion matrix is now much cleaner, with only minimal misclassifications of actual cards as background noise. Real-time blackjack card detection and odds calculation Real-time blackjack card detection and odds calculation With the model trained and performing, it was time to deploy it and play some blackjack. Initially, I tested the system using Roboflow’s hosted API, which took around 4 seconds per inference - far too slow for practical gameplay. However, running the model locally on my laptop dramatically improved performance, achieving inference times of less than 0.1 seconds per image (1.3ms preprocess, 45.5ms inference, 0.4ms postprocess per image). I then integrated the model with MSS to capture a real-time feed of my browser window. The system automatically overlays the detected cards with their predicted values and confidence scores Overview of selected fitted curves Overview of selected fitted curves The final implementation successfully combines the pieces: the computer vision model detects and identifies cards in real-time, feeds this information to the Monte Carlo simulation, and displays both the card recognition results and the calculated odds directly on screen - do not try this at your local (online) casino!

Modeling Glycemic Response with XGBoost

Earlier this year I wrote how I built a CGM data reader after wearing a continuous glucose monitor myself. Since I was already logging my macronutrients and learning more about molecular biology in an MIT MOOC I became curious if given a meal’s macronutrients (carbs, protein, fat) and some basic individual characteristics (age, BMI), these could serve as features in a regressor machine learning model to predict the curve parameters of the postprandial glucose curve (how my blood sugar levels change after eating). I came across a paper on Personalized Nutrition by Prediction of Glycemic Responses which used machine learning to predict individual glycemic responses from meal data, exactly what I had in mind. Unfortunately, neither the data nor the code were publicly available. And - I wanted to predict my own glycemic response curve. So I decided to build my own model. In the process I wrote this working paper. Overview of Working Paper Pages Overview of Working Paper Pages The paper represents an exercise in applying machine learning techniques to medical applications. The methodologies employed were largely inspired by Zeevi et al.’s approach. I quickly realized that training a model on my own data only was not very promising if not impossible. To tackle this, I used the publicly available Hall dataset containing continuous glucose monitoring data from 57 adults, which I narrowed down to 112 standardized meals from 19 non-diabetic subjects with their respective glucose curve after the meal (full methodology in the paper). Overview of the CGM pipeline workflow Overview of the CGM pipeline workflow Rather than trying to predict the entire glucose curve, I simplified the problem by fitting each postprandial response to a normalized Gaussian function. This gave me three key parameters to predict: amplitude (how high glucose rises), time-to-peak (when it peaks), and curve width (how long the response lasts). Overview of single fitted curve of cgm measurements Overview of single fitted curve of cgm measurements The Gaussian approximation worked surprisingly well for characterizing most glucose responses. While some curves fit better than others, the majority of postprandial responses were well-captured, though there’s clear variation between individuals and meals. Some responses were high amplitude, narrow width, while others are more gradual and prolonged. Overview of selected fitted curves Overview of selected fitted curves I then trained an XGBoost regressor with 27 engineered features including meal composition, participant characteristics, and interaction terms. XGBoost was chosen for its ability to handle mixed data types, built-in feature importance, and strong performance on tabular data. The pipeline included hyperparameter tuning with 5-fold cross-validation to optimize learning rate, tree depth, and regularization parameters. Rather than relying solely on basic meal macronutrients, I engineered features across multiple categories and implemented CGM statistical features calculated over different time windows (24-hour and 4-hour periods), including time-in-range and glucose variability metrics. Architecture wise, I trained three separate XGBoost regressors - one for each Gaussian parameter.

While the model achieved moderate success predicting amplitude (R² = 0.46), it completely failed at predicting timing - time-to-peak prediction was essentially random (R² = -0.76), and curve width prediction was barely better (R² = 0.10). Even the amplitude prediction, while statistically significant, falls well short of an R² > 0.7. Studies that have achieved better predictive performance typically used much larger datasets (>1000 participants). For my original goal of predicting my own glycemic responses, this suggests that either individual-specific models trained on extensive personal data, or much more sophisticated approaches incorporating larger training datasets, would be necessary.

The complete code, Jupyter notebooks, processed datasets, and supplementary results are available in my GitHub repository.
_ _

(10/06/2025) Update: Today I came across Marcel Salathé’s LinkedIn post on a publication out of EPFL: Personalized glucose prediction using in situ data only.

With data from over 1,000 participants of the Food & You digital cohort, we show that a machine learning model using only food data from myFoodRepo and a glucose monitor can closely track real blood sugar responses to any meal (correlation of 0.71).

As expected Singh et. al. achieve a substantially better predictive performance (R = 0.71 vs R² = 0.46). Besides probably higher methodological rigor and scientific quality, the most critical difference is sample size - their 1'000+ participants versus my 19 participants (from the Hall dataset) represents a fundamental difference in statistical power and generalizability. They addressed one of the shortcomings I faced by leveraging a large digital nutritional cohort from the “Food & You” study (including high-resolution data of nutritional intake of more than 46 million kcal collected from 315'126 dishes over 23'335 participant days, 1'470'030 blood glucose measurements, 49'110 survey responses, and 1'024 samples for gut microbiota analysis).

Apart from that I am excited to - at a first glance - observe the following similarities: (1) Both aim to predict postprandial glycemic responses using machine learning, with a focus on personalized nutrition applications. (2) Both employ XGBoost regression as their primary predictive algorithm and use similar performance metrics (R², RMSE, MAE, Pearson correlation). (3) Both extract comprehensive feature sets including meal composition (macronutrients), temporal features, and individual characteristics. (4) Both use mathematical approaches to characterize glucose responses - I used Gaussian curve fitting, while Singh et. al. use incremental area under the curve (iAUC). (5) Both employ cross-validation techniques for model evaluation and hyperparameter tuning. (6) SHAP Analysis: Both use SHAP for model interpretability and feature importance analysis.

Trading on Market Sentiment

This post is based in part on a 2022 presentation I gave for the ICBS Student Investment Fund and my seminar work at Imperial College London.

As we were looking for new investment strategies for our Macro Sentiment Trading team, OpenAI had just published their GPT-3.5 Model. After first experiments with the model, we asked ourselves: How would large language models like GPT-3.5 perform in predicting sentiment in financial markets, where the signal-to-noise ratio is notoriously low? And could they potentially even outperform industry benchmarks at interpreting market sentiment from news headlines? The idea wasn’t entirely new. Studies [2] [3] have shown that investor sentiment, extracted from news and social media, can forecast market movements. But most approaches rely on traditional NLP models or proprietary systems like RavenPack. With the recent advances in large language models, I wanted to test whether these more sophisticated models could provide a competitive edge in sentiment-based trading. Before looking at model selection, it’s worth understanding what makes trading on sentiment so challenging. News headlines present two fundamental problems that any robust system must address. Relative frequency of monthly Google News Search terms over 5 years. Numbers represent search interest relative to highest point. A value of 100 is the peak popularity for the term. Relative frequency of monthly Google News Search terms over 5 years. Numbers represent search interest relative to highest point. A value of 100 is the peak popularity for the term. First, headlines are inherently non-stationary. Unlike other data sources, news reflects the constantly shifting landscape of global events, political climates, economic trends, etc. A model trained on COVID-19 vaccine headlines from 2020 might struggle with geopolitical tensions in 2023. This temporal drift means algorithms must be adaptive to maintain relevance. Impact of headlines measured by subsequent index move (Data Source: Bloomberg) Impact of headlines measured by subsequent index move (Data Source: Bloomberg) Second, the relationship between headlines and market impact is far from obvious. Consider these actual headlines from November 2020: “Pfizer Vaccine Prevents 90% of COVID Infections” drove the S&P 500 up 1.85%, while “Pfizer Says Safety Milestone Achieved” barely moved the market at -0.05%. The same company, similar positive news, dramatically different market reactions.

When developing a sentiment-based trading system, you essentially have two conceptual approaches: forward-looking and backward-looking. Forward-looking models try to predict which news themes will drive markets, often working qualitatively by creating logical frameworks that capture market expectations. This approach is highly adaptable but requires deep domain knowledge and is time-consuming to maintain. Backward-looking models analyze historical data to understand which headlines have moved markets in the past, then look for similarities in current news. This approach can leverage large datasets and scale efficiently, but suffers from low signal-to-noise ratios and the challenge that past relationships may not hold in the future. For this project, I chose the backward-looking approach, primarily for its scalability and ability to work with existing datasets.

Rather than rely on traditional approaches like FinBERT (which only provides discrete positive/neutral/negative classifications), I decided to test OpenAI’s GPT-3.5 Turbo model. The key advantage was its ability to provide continuous sentiment scores from -1 to 1, giving much more nuanced signals for trading decisions. I used news headlines from the Dow Jones Newswire covering the 30 DJI companies from 2018-2022, filtering for quality sources like the Wall Street Journal and Bloomberg. After removing duplicates, this yielded 2,072 headlines. I then prompted GPT-3.5 to score sentiment with the instruction: Rate the sentiment of the following news headlines from -1 (very bad) to 1 (very good), with two decimal precision. To validate the approach, I compared GPT-3.5 scores against RavenPack—the industry’s leading commercial sentiment provider. Sample entries of the combined data set. Sample entries of the combined data set. The correlation was 0.59, indicating the models generally agreed on sentiment direction while providing different granularities of scoring. More interesting was comparing the distribution of the sentiment ratings between the two models. This could have been approximated closer through some fine tuning of the (minimal) prompt used earlier. Comparing the distribution of the sentiment scores generated using the GPT-3.5 model with the benchmark scores from RavenPack. Comparing the distribution of the sentiment scores generated using the GPT-3.5 model with the benchmark scores from RavenPack. I implemented a simple strategy: go long when sentiment hits the top 5% of scores, close positions at 25% profit (to reduce transaction costs), and maintain a fully invested portfolio with 1% commission per trade. The results were mixed but promising. Over the full 2018-2022 period, the GPT-3.5 strategy generated 41.02% returns compared to RavenPack’s 40.99%—essentially matching the industry benchmark. However, both underperformed a simple buy-and-hold approach (58.13%) during this generally bullish period. Relying on market sentiment when news flow is low can be a tricky strategy. As can be seen from the example of the Salesforce stock performance**,** the strategy remained uninvested over a large period of time due to a (sometimes long-lasting) negative sentiment signal. Stock performance of Salesforce (CRM) for 5 years from 2018 with sentiment indicators overlayed. Stock performance of Salesforce (CRM) for 5 years from 2018 with sentiment indicators overlayed. When I tested different timeframes, the sentiment strategy showed its strength during volatile periods. From 2020-2022, it outperformed buy-and-hold (22.83% vs 21.00%). As expected, sentiment-based approaches work better when markets are less directional and more driven by news flow. To evaluate whether the scores generated by our GPT prompt were more accurate than those from the RavenPack benchmark, I calculated returns for different holding windows. The scores generated by our GPT prompt perform significantly better in the short term (1 and 10 days) for positive sentiment and in the long term (90 days) for negative sentiment. Average 1, 10, 30, and 90-day holding period return for both models. Average 1, 10, 30, and 90-day holding period return for both models. (Note: For lower sentiment, negative returns are desirable since the stock would be shorted)

While the model performed well technically, this project highlighted several practical challenges. First, data accessibility remains a major hurdle—getting real-time, high-quality news feeds is expensive and often restricted. Second, the strategy worked better in a more volatile environment, which prompted many individual trades, creating substantial transaction costs that significantly impact returns. Perhaps most importantly, any real-world implementation would need to compete with high-frequency traders who can act on news within milliseconds. The few seconds required for GPT-3.5 to process headlines and generate sentiment scores are far from being competitive. Despite these challenges, the project demonstrated that LLMs can match industry benchmarks for sentiment analysis—and this was using a general-purpose model, not one specifically fine-tuned for financial applications. OpenAI (and others) today offer more powerful models at very low cost as well as fine-tuning capabilities that could further improve performance. The bigger opportunity might be in combining sentiment signals with other factors, using sentiment as one input in a more sophisticated trading system rather than the sole decision criterion. There’s also potential in expanding beyond simple long-only strategies to include short positions on negative sentiment, or developing “sentiment indices” that smooth out individual headline noise. Market sentiment strategies may not be optimal for long-term investing, but they show clear promise for shorter-term trading in volatile environments. As LLMs continue to improve and become more accessible, this might offer an opportunity to revisit this project.