the agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster through natural language commands
I was looking for a task to test its abilities. Fairly quickly we wrote less than 200 lines of python code predicting black jack odds using Monte Carlo Simulation. When I went on to test this little tool on Washington Post’s online Black Jack (I also didn’t know that existed!) I quickly noticed how impractical it was to manually input all the card values on the table manually. What if the tool would also automatically recognize the cards that are on the table and calculate the odds from it? I have never done anything with computer vision so this seemed like a good challenge.
To get to any reasonable result we have to start with classification where we “teach” the model to categorize data by showing them lots of examples with correct labels. But where do the labels come from? I manually annotated 409 playing cards across 117 images using Roboflow Annotate (at first I only did half as much - why this wasn’t a good idea we’ll see in a minute). Once enough screenshots of cards were annotated we can train the model to recognize the cards and predict card values on tables it has never seen before. I was able to use a NVIDIA T4 GPU inside Google Colab which offers some GPU time for free when capacity is available.
During training, the algorithm learns patterns from this example data, adjusting its internal parameters millions of times until it gets really good at recognizing the differences between categories (in this case different cards). Once trained, the model can then make predictions on new, unseen data by applying the patterns it learned. With the annotated dataset ready, it was time to implement the actual computer vision model. I chose to run inference on Ultralytics’ YOLOv11 pre-trained model, a state-of-the-art object detection algorithm. I set up the environment in Google Colab following the “How to Train YOLO11 Object Detection on a Custom Dataset” notebook. After extracting the annotated dataset from Roboflow, I began training the model using the pre-trained YOLOv11s weights as a starting point. This approach, called transfer learning, allows the model to leverage patterns already learned from millions of general images and adapt them to this specific task.
I initially set it up to run for 350 epochs, though the model’s built-in early stopping mechanism kicked in after 242 epochs when no improvement was observed for 100 consecutive epochs. The best results were achieved at epoch 142, taking around 13 minutes to complete on the Tesla T4 GPU.
The initial results were quite promising, with an overall mean Average Precision (mAP) of 80.5% at IoU threshold 0.5. Most individual card classes achieved good precision and recall scores, with only a few cards like the 6 and Queen showing slightly lower precision values.
However, looking at the confusion matrix and loss curves revealed some interesting patterns. While the model was learning effectively (as shown by the steadily decreasing loss), there were still some misclassifications between similar cards, particularly among the numbered cards. This highlighted exactly why I mentioned earlier that annotating only half the amount of data initially “wasn’t a good idea” - more training examples would likely improve these edge cases and reduce confusion between similar-looking cards. My first attempt at solving the remaining accuracy issues was to add another layer to the workflow by sending the detected cards to Anthropic’s Claude API for additional OCR processing.
This hybrid approach was very effective - the combination of YOLO’s object detection to dynamically crop down the Black Jack table to individual cards with Claude’s advanced vision capabilities yielded 99.9% accuracy on the predicted cards. However, this solution came with a significant drawback: the additional API layer consumed valuable time and the large model’s processing overhead, making it impractical for real-time gameplay.
Seeking a faster solution, I implemented the same workflow locally using easyOCR instead. EasyOCR seems to be really good at extracting black text on white background but might struggle with everything else. While it was able to correctly identify the card numbers when it detected them, it struggled to recognize around half of the cards in the first place - even when fed pre-cropped card images directly from the YOLO model. This inconsistency made it unreliable for the application.
Rather than continue band-aid solutions, I decided to go back and improve my dataset. I doubled the training data by adding another 60 screenshots with the same train/test split as before. More importantly, I went through all the previous annotations and fixed many of the bounding polygons. I noticed that several misidentifications were caused by the model detecting face-down dealer cards as valid cards, which happened because some annotations for face-up cards inadvertently included parts of the card backs next to them. The improved dataset and cleaned annotations delivered what I was hoping for: The confusion matrix now shows a much cleaner diagonal pattern, indicating that the model now correctly identifies most cards without the cross-contamination issues we saw earlier.
Both the training and validation losses converge smoothly without signs of overfitting, while the precision and recall metrics climb steadily to plateau near perfect scores. The mAP@50 reaches an impressive 99.5%. Most significantly, the confusion matrix now shows that the model has virtually eliminated false positives with background elements. The “background” column (rightmost) in the confusion matrix is now much cleaner, with only minimal misclassifications of actual cards as background noise.
With the model trained and performing, it was time to deploy it and play some blackjack. Initially, I tested the system using Roboflow’s hosted API, which took around 4 seconds per inference - far too slow for practical gameplay. However, running the model locally on my laptop dramatically improved performance, achieving inference times of less than 0.1 seconds per image (1.3ms preprocess, 45.5ms inference, 0.4ms postprocess per image). I then integrated the model with MSS to capture a real-time feed of my browser window. The system automatically overlays the detected cards with their predicted values and confidence scores
The final implementation successfully combines the pieces: the computer vision model detects and identifies cards in real-time, feeds this information to the Monte Carlo simulation, and displays both the card recognition results and the calculated odds directly on screen - do not try this at your local (online) casino!
Earlier this year I wrote how I built a CGM data reader after wearing a continuous glucose monitor myself. Since I was already logging my macronutrients and learning more about molecular biology in an MIT MOOC I became curious if given a meal’s macronutrients (carbs, protein, fat) and some basic individual characteristics (age, BMI), these could serve as features in a regressor machine learning model to predict the curve parameters of the postprandial glucose curve (how my blood sugar levels change after eating). I came across a paper on Personalized Nutrition by Prediction of Glycemic Responses which did exactly that. Unfortunately, neither the data nor the code were publicly available. And - I wanted to predict my own glycemic response curve. So I decided to build my own model. In the process I wrote this working paper.
The paper represents an exercise in applying machine learning techniques to medical applications. The methodologies employed were largely inspired by Zeevi et al.’s approach. I quickly realized that training a model on my own data only was not very promising if not impossible. To tackle this, I used the publicly available Hall dataset containing continuous glucose monitoring data from 57 adults, which I narrowed down to 112 standardized meals from 19 non-diabetic subjects with their respective glucose curve after the meal (full methodology in the paper).
Rather than trying to predict the entire glucose curve, I simplified the problem by fitting each postprandial response to a normalized Gaussian function. This gave me three key parameters to predict: amplitude (how high glucose rises), time-to-peak (when it peaks), and curve width (how long the response lasts).
The Gaussian approximation worked surprisingly well for characterizing most glucose responses. While some curves fit better than others, the majority of postprandial responses were well-captured, though there’s clear variation between individuals and meals. Some responses were high amplitude, narrow width, while others are more gradual and prolonged.
I then trained an XGBoost regressor with 27 engineered features including meal composition, participant characteristics, and interaction terms. XGBoost was chosen for its ability to handle mixed data types, built-in feature importance, and strong performance on tabular data. The pipeline included hyperparameter tuning with 5-fold cross-validation to optimize learning rate, tree depth, and regularization parameters. Rather than relying solely on basic meal macronutrients, I engineered features across multiple categories and implemented CGM statistical features calculated over different time windows (24-hour and 4-hour periods), including time-in-range and glucose variability metrics. Architecture wise, I trained three separate XGBoost regressors - one for each Gaussian parameter.
While the model achieved moderate success predicting amplitude (R² = 0.46), it completely failed at predicting timing - time-to-peak prediction was essentially random (R² = -0.76), and curve width prediction was barely better (R² = 0.10). Even the amplitude prediction, while statistically significant, falls well short of an R² > 0.7. Studies that have achieved better predictive performance typically used much larger datasets (>1000 participants). For my original goal of predicting my own glycemic responses, this suggests that either individual-specific models trained on extensive personal data, or much more sophisticated approaches incorporating larger training datasets, would be necessary.
The complete code, Jupyter notebooks, processed datasets, and supplementary results are available in my GitHub repository. _ _
With data from over 1,000 participants of the Food & You digital cohort, we show that a machine learning model using only food data from myFoodRepo and a glucose monitor can closely track real blood sugar responses to any meal (correlation of 0.71).
As expected Singh et. al. achieve a substantially better predictive performance (R = 0.71 vs R² = 0.46). Besides probably higher methodological rigor and scientific quality, the most critical difference is sample size - their 1'000+ participants versus my 19 participants (from the Hall dataset) represents a fundamental difference in statistical power and generalizability. They addressed one of the shortcomings I faced by leveraging a large digital nutritional cohort from the “Food & You” study (including high-resolution data of nutritional intake of more than 46 million kcal collected from 315'126 dishes over 23'335 participant days, 1'470'030 blood glucose measurements, 49'110 survey responses, and 1'024 samples for gut microbiota analysis).
Apart from that I am excited to - at a first glance - observe the following similarities:
(1) Both aim to predict postprandial glycemic responses using machine learning, with a focus on personalized nutrition applications.
(2) Both employ XGBoost regression as their primary predictive algorithm and use similar performance metrics (R², RMSE, MAE, Pearson correlation).
(3) Both extract comprehensive feature sets including meal composition (macronutrients), temporal features, and individual characteristics.
(4) Both use mathematical approaches to characterize glucose responses - I used Gaussian curve fitting, while Singh et. al. use incremental area under the curve (iAUC).
(5) Both employ cross-validation techniques for model evaluation and hyperparameter tuning.
(6) SHAP Analysis: Both use SHAP for model interpretability and feature importance analysis.
This post is based in part on a 2022 presentation I gave for the ICBS Student Investment Fund and my seminar work at Imperial College London.
As we were looking for new investment strategies for our Macro Sentiment Trading team, OpenAI had just published their GPT-3.5 Model. After first experiments with the model, we asked ourselves: How would large language models like GPT-3.5 perform in predicting sentiment in financial markets, where the signal-to-noise ratio is notoriously low? And could they potentially even outperform industry benchmarks at interpreting market sentiment from news headlines? The idea wasn’t entirely new. Studies[2][3] have shown that investor sentiment, extracted from news and social media, can forecast market movements. But most approaches rely on traditional NLP models or proprietary systems like RavenPack. With the recent advances in large language models, I wanted to test whether these more sophisticated models could provide a competitive edge in sentiment-based trading. Before looking at model selection, it’s worth understanding what makes trading on sentiment so challenging. News headlines present two fundamental problems that any robust system must address.
First, headlines are inherently non-stationary. Unlike other data sources, news reflects the constantly shifting landscape of global events, political climates, economic trends, etc. A model trained on COVID-19 vaccine headlines from 2020 might struggle with geopolitical tensions in 2023. This temporal drift means algorithms must be adaptive to maintain relevance.
Second, the relationship between headlines and market impact is far from obvious. Consider these actual headlines from November 2020: “Pfizer Vaccine Prevents 90% of COVID Infections” drove the S&P 500 up 1.85%, while “Pfizer Says Safety Milestone Achieved” barely moved the market at -0.05%. The same company, similar positive news, dramatically different market reactions.
When developing a sentiment-based trading system, you essentially have two conceptual approaches: forward-looking and backward-looking.
Forward-looking models try to predict which news themes will drive markets, often working qualitatively by creating logical frameworks that capture market expectations. This approach is highly adaptable but requires deep domain knowledge and is time-consuming to maintain.
Backward-looking models analyze historical data to understand which headlines have moved markets in the past, then look for similarities in current news. This approach can leverage large datasets and scale efficiently, but suffers from low signal-to-noise ratios and the challenge that past relationships may not hold in the future.
For this project, I chose the backward-looking approach, primarily for its scalability and ability to work with existing datasets.
Rather than rely on traditional approaches like FinBERT (which only provides discrete positive/neutral/negative classifications), I decided to test OpenAI’s GPT-3.5 Turbo model. The key advantage was its ability to provide continuous sentiment scores from -1 to 1, giving much more nuanced signals for trading decisions. I used news headlines from the Dow Jones Newswire covering the 30 DJI companies from 2018-2022, filtering for quality sources like the Wall Street Journal and Bloomberg. After removing duplicates, this yielded 2,072 headlines. I then prompted GPT-3.5 to score sentiment with the instruction: Rate the sentiment of the following news headlines from -1 (very bad) to 1 (very good), with two decimal precision. To validate the approach, I compared GPT-3.5 scores against RavenPack—the industry’s leading commercial sentiment provider.
The correlation was 0.59, indicating the models generally agreed on sentiment direction while providing different granularities of scoring. More interesting was comparing the distribution of the sentiment ratings between the two models. This could have been approximated closer through some fine tuning of the (minimal) prompt used earlier.
I implemented a simple strategy: go long when sentiment hits the top 5% of scores, close positions at 25% profit (to reduce transaction costs), and maintain a fully invested portfolio with 1% commission per trade.
The results were mixed but promising. Over the full 2018-2022 period, the GPT-3.5 strategy generated 41.02% returns compared to RavenPack’s 40.99%—essentially matching the industry benchmark. However, both underperformed a simple buy-and-hold approach (58.13%) during this generally bullish period. Relying on market sentiment when news flow is low can be a tricky strategy. As can be seen from the example of the Salesforce stock performance**,** the strategy remained uninvested over a large period of time due to a (sometimes long-lasting) negative sentiment signal.
When I tested different timeframes, the sentiment strategy showed its strength during volatile periods. From 2020-2022, it outperformed buy-and-hold (22.83% vs 21.00%). As expected, sentiment-based approaches work better when markets are less directional and more driven by news flow. To evaluate whether the scores generated by our GPT prompt were more accurate than those from the RavenPack benchmark, I calculated returns for different holding windows. The scores generated by our GPT prompt perform significantly better in the short term (1 and 10 days) for positive sentiment and in the long term (90 days) for negative sentiment.
(Note: For lower sentiment, negative returns are desirable since the stock would be shorted)
While the model performed well technically, this project highlighted several practical challenges. First, data accessibility remains a major hurdle—getting real-time, high-quality news feeds is expensive and often restricted. Second, the strategy worked better in a more volatile environment, which prompted many individual trades, creating substantial transaction costs that significantly impact returns. Perhaps most importantly, any real-world implementation would need to compete with high-frequency traders who can act on news within milliseconds. The few seconds required for GPT-3.5 to process headlines and generate sentiment scores are far from being competitive. Despite these challenges, the project demonstrated that LLMs can match industry benchmarks for sentiment analysis—and this was using a general-purpose model, not one specifically fine-tuned for financial applications. OpenAI (and others) today offer more powerful models at very low cost as well as fine-tuning capabilities that could further improve performance. The bigger opportunity might be in combining sentiment signals with other factors, using sentiment as one input in a more sophisticated trading system rather than the sole decision criterion. There’s also potential in expanding beyond simple long-only strategies to include short positions on negative sentiment, or developing “sentiment indices” that smooth out individual headline noise.
Market sentiment strategies may not be optimal for long-term investing, but they show clear promise for shorter-term trading in volatile environments. As LLMs continue to improve and become more accessible, this might offer an opportunity to revisit this project.
Last year I put a Continuous Glucose Monitor (CGM) sensor, specifically the Abbott Freestyle Libre 3, on my left arm. Why? I wanted to optimize my nutrition for endurance cycling competitions. Where I live, the sensor is easy to get—without any medical prescription—and even easier to use. Unfortunately, Abbott’s FreeStyle LibreLink app is less than optimal (3,250 other people with an average rating of 2.9/5.0 seem to agree). In their defense, the web app LibreView does offer some nice reports which can be generated as PDFs—not very dynamic, but still something! What I had in mind was more in the fashion of the Ultrahuman M1 dashboard. Unfortunately, I wasn’t allowed to use my Libre sensor (EU firmware) with their app (yes, I spoke to customer service).
At that point, I wasn’t left with much enthusiasm, only a coin-sized sensor in my arm. The LibreView website fortunately lets you download most of your (own) data in a CSV report (there is also a reverse engineered API), which is nice. So that’s what I did: download the data, pd.read_csv() it into my notebook, calculate summary statistics, and plot the values.
After some interpolation, I now had the same view as the LibreLink app (which I had rejected earlier) provided. Yet, this setup allowed me to do further analysis and visualizations by adding other datapoints (workouts, sleep, nutrition) I was also collecting at that time:
Blood sugar from LibreView: Measurement timestamps + glucose values
Nutrition from MacroFactor: Meal timestamps + macronutrients (carbs, protein, and fat)
Sleep data from Sleep Cycle: Sleep start timestamp + time in bed + time asleep (+ sleep quality, which is a proprietary measure calculated by the app)
Cardio workouts from Garmin: Workout start timestamp + workout duration
Strength workouts from Hevy: Workout start timestamp + workout duration
After structuring those datapoints in a dataframe and normalizing timestamps, I was able to quickly highlight sleep (blue boxes with callouts for time in bed, time asleep, and sleep quality) and workouts (red traces on glucose measurements for strength workouts, green traces for cardio workouts) by plotting highlighted traces on top of the historic glucose trail for a set period. Furthermore, I was able to add annotations for nutrition events with the respective macronutrients.
I asked Claude to create some sample data and streamline the functions to reduce dependencies on the specific data sources I used. The resulting notebook is a comprehensive CGM data analysis tool that loads and processes glucose readings alongside lifestyle data (nutrition, workouts, and sleep), then creates an integrated dashboard for visualization. The code handles data preprocessing including interpolation of missing glucose values, timeline synchronization across different data sources, and statistical analysis with key metrics like time-in-range and coefficient of variation. The main output is a day-by-day dashboard that overlays workout periods, nutrition events, and sleep phases onto continuous glucose monitoring data, enabling users to identify patterns and correlations between lifestyle factors and blood sugar responses.
In late 2021, Lars Kaiser’s paper on seasonality in cryptocurrencies inspired me to use my Kraken API Key to try and make some money. A quick summary of the paper: (1) Kaiser analyzes seasonality patterns across 10 cryptocurrencies (Bitcoin, Ethereum, etc.), examining returns, volatility, trading volume, and spreads (2) Finds no consistent calendar effects in cryptocurrency returns, supporting weak-form market efficiency (3) Observes robust patterns in trading activity - lower volume, volatility, and spreads in January, weekends, and summer months (4) Documents significant impact of January 2018 market sell-off on seasonality patterns (5) Reports a “reverse Monday effect” for Bitcoin (positive Monday returns) and “reverse January effect” (negative January returns) (6) Trading activity patterns suggest crypto markets are dominated by retail rather than institutional investors.
The paper’s main finding: crypto markets appear efficient in terms of returns but show behavioral patterns in trading.
The efficient-market hypothesis (EMH) is a hypothesis in financial economics that states that asset prices reflect all available information. A direct implication is that it is impossible to “beat the market” consistently on a risk-adjusted basis since market prices should only react to new information.
The EMH has interesting implications for cryptocurrency markets. While major cryptocurrencies like Bitcoin and Ethereum have gained significant institutional adoption and liquidity, they may still be less efficient than traditional markets due to their relative youth and large audience of retail traders (who might not act as rationally as larger, institutional traders). This inefficiency becomes even more pronounced with smaller altcoins, which often have: (1) Lower trading volumes and liquidity (2) Less institutional participation (3) Higher information asymmetries (and/or greater susceptibility to manipulation). These factors create opportunities for exploiting market inefficiencies, particularly in the short term when prices may overreact to news or technical signals before eventually correcting.
Unlike Kaiser’s seasonality research, I didn’t focus on calendar-based anomalies over longer time horizons. After reviewing further research on cryptocurrency market inefficiencies [1][2][3][4], I was intrigued by predictable patterns in returns following large price movements. This led me to develop a classic mean reversion strategy instead (mean reversion suggests that asset prices tend to revert to their long-term average after extreme movements due to market overreactions and subsequent corrections).
First, I had to find “change points.” The PELT algorithm efficiently identifies points in ETH/EUR where the statistical properties of the time series change significantly. These changes could indicate market events, trend reversals, or volatility shifts in the cryptocurrency price.
I then implemented an automated mean reversion trading strategy following this logical flow: Continuous monitoring → Signal detection → Buy execution → Hold period → Sell execution. The script continuously monitored prices for certain cryptocurrencies on Kraken exchange. It executed buy orders when the price moved more than four standard deviations over a 2-hour period, then automatically sold after exactly 2 hours regardless of price movement. The strategy used fixed position sizes and limit orders to minimize fees. It assumed that large price drops represent temporary market overreactions that will reverse within the holding period.
This little script earned some good change—but then again, it was 2021.
My introduction to quantitative portfolio optimization happened during my undergraduate years, inspired by Attilio Meucci’s Risk and Asset Allocation and the convex optimization teachings of Diamond and Boyd at Stanford. With enthusiasm and perhaps more confidence than expertise, I created my first “optimal” portfolio. What struck me most was the disconnect between theory and accessibility. Modern Portfolio Theory had been established since 1990, yet the optimization tools remained largely locked behind proprietary software.
Nevertheless, only a few comprehensive software models are available publicly to use, study, or modify. We tackle this issue by engineering practical tools for asset allocation and implementing them in the Python programming language.
The focus is to keep the tools simple enough for interested practitioners to understand the underlying theory yet provide adequate numerical solutions.
Today, the landscape has evolved dramatically. Projects like PyPortfolioOpt and Riskfolio-Lib have established themselves as sophisticated open-source alternatives, far surpassing my early efforts in both scope and sophistication. Despite its limitations, the project yielded several meaningful insights:
First, I set out to visualize Modern Portfolio Theory’s fundamental principle—the risk-return tradeoff that drives optimization decisions. This scatter plot showing the efficient frontier demonstrates this core concept.
The results of my first optimization: maintaining a 9.386% return while reducing volatility from 14.445% to 5.574%, effectively tripling the Sharpe ratio from 0.650 to 1.684.
By varying the risk aversion parameter (gamma), the framework successfully adapted to different investor profiles, showcasing the flexibility of the optimization approach. This efficient frontier plot with different gamma values illustrates how the optimization framework adapts to different investor risk preferences.
Perhaps most importantly, out-of-sample testing across diverse market conditions—including the 2018 bear market and 2019 bull market—demonstrated consistent CVaR reduction and improved risk-adjusted returns.
We demonstrate how even in an environment with high correlation, achieving a competitive return with a lower expected shortfall and lower excess risk than the given benchmark over multiple periods is possible.
Looking back, the project feels embarrassingly naive—and surprisingly foundational. While it earned some recognition at the time, it now serves as a valuable reminder: sometimes the best foundation is built before you know enough to doubt yourself.
Similar to how Simon Willison describes his difficulties managing images for his approach to running a link blog I found it hard to remain true to pure markdown syntax but have images embedded in a responsive way on this site.
My current pipeline is as follows: I host my all my images in a R2 bucket and serve them from static.philippdubach.com. I use Cloudflares’s image resizing CDN do I never have to worry about serving images in appropriate size or format. I basically just upload them with the highes possible quality and Cloudflare takes care of the rest.
Since the site runs on Hugo, I needed a solution that would work within this static site generation workflow. Pure markdown syntax like  is clean and portable, but it doesn’t give me the responsive image capabilities I was looking for.
The solution I settled on was creating a Hugo shortcode that leverages Cloudflare’s image transformations while maintaining a simple, markdown-like syntax.
The shortcode generates a <picture> element with multiple <source> tags, each targeting different screen sizes and serving WebP format. Here’s how it works: instead of writing standard markdown image syntax, I use {{ img src="image.jpg" alt="Description" }} in my posts. Behind the scenes, the shortcode constructs URLs for different breakpoints. This means I upload one high-quality image, but users receive perfectly sized versions - a 320px wide WebP for mobile users, a 1600px version for desktop, and everything in between. The shortcode defaults to displaying images at 80% width and centered, but I can override this with a width parameter when needed. It’s a nice compromise between the simplicity of markdown and the power of modern responsive image techniques. The syntax remains clean and the performance benefits are substantial - especially important since images are often the heaviest assets on any webpage.
(May 2025) Update:
Completed migration to a fully custom Hugo build. Originally started with a fork of hugo-blog-awesome, but I’ve since rebuilt it from scratch.
(June 2025) Update:
Added LaTeX math rendering using MathJax 3. Created a minimal partial template that only loads the MathJax library on pages that actually contain LaTeX expressions.
(June 2025) Update II:
Enhanced SEO capabilities with comprehensive metadata support. Built custom Open Graph integration that automatically generates social media previews, plus added per-post keyword management.