Everything Is a DCF Model

A brilliant piece of writing from Michael Mauboussin and Dan Callahan at Morgan Stanley that was formative in what I personally believe when it comes to valuation.

[…] we want to suggest the mantra “everything is a DCF model.” The point is that whenever investors value a stake in a cash-generating asset, they should recognize that they are using a discounted cash flow (DCF) model. […] The value of those businesses is the present value of the cash they can distribute to their owners. This suggests a mindset that is very different from that of a speculator, who buys a stock in anticipation that it will go up without reference to its value. Investors and speculators have always coexisted in markets, and the behavior of many market participants is a blend of the two.

Original paper linked in this post’s title.

LLM Helped Discover a New Cancer Therapy Pathway

Google gets a lot of scrutiny for some of their work in other domains; nevertheless, it’s fair to appreciate that they continue to put major resources behind using AI to accelerate therapeutic discovery. The model and resources are open access and available to the research community.

How C2S-Scale 27B works: A major challenge in cancer immunotherapy is that many tumors are “cold” — invisible to the body’s immune system. A key strategy to make them “hot” is to force them to display immune-triggering signals through a process called antigen presentation. We gave our new C2S-Scale 27B model a task: Find a drug that acts as a conditional amplifier, one that would boost the immune signal only in a specific “immune-context-positive” environment where low levels of interferon (a key immune-signaling protein) were already present, but inadequate to induce antigen presentation on their own.

From their press release:

C2S-Scale generated a novel hypothesis about cancer cellular behavior and we have since confirmed its prediction with experimental validation in living cells. This discovery reveals a promising new pathway for developing therapies to fight cancer.

For a 27B model, that’s really really neat! And on a more general note, scaling seems to deliver:

This work raised a critical question: Does a larger model just get better at existing tasks, or can it acquire entirely new capabilities? The true promise of scaling lies in the creation of new ideas, and the discovery of the unknown.

On a more critical note, it would be interesting to see whether this model can perform any better than existing simple linear models for predicting gene expression interactions.

Original bioRxiv paper linked in this post’s title.

The State of AI Report 2025

This year’s rendition of The State of AI Report is making rounds on LinkedIn (yes, LinkedIn the place where the great E = MC2 + AI equation was “discovered”).

Worth keeping in mind this is made by Nathan Benaich the Founder of Air Street Capital, a venture capital firm investing in “AI-first companies”, so obviously comes with a lot of bias. It’s also a relatively small, open survey, with 1'200 “AI practitioners” surveyed. An example of the bias:

shows that 95% of professionals now use AI at work or home

It’s obvious that 95% of professionals don’t use AI at work or home, and these results are heavily skewed. Nevertheless, the slide deck has a nice comprehensive review of research headlines over the year:

  • OpenAI, Google, Anthropic, and DeepSeek all releasing reasoning models capable of planning, verification, and self-correction.
  • China’s AI systems closed the gap to establish the country as a credible #2 in global AI capability.
  • 44% of U.S. businesses now paying for AI tools (up from just 5% in 2023), average contracts reaching $530'000, and 95% of surveyed professionals using AI regularly—while the capability-to-price ratio doubles every 6-8 months.
  • Multi-gigawatt data centers backed by sovereign funds compete globally, with power supply and land becoming as critical as GPUs.

    (13/10/2025) Update: I was just reminded that a sample size of 1'200 is highly statistically significant, even for a national-level poll. The main concern here, however, remains, which is potential selection bias, likely stemming from the fact that participation is driven by people who want to take the survey. It’s unclear how much this bias affects the results, but purely in terms of sample size, it is more than sufficient.

Popular Science Nobel Prize

Mary E. Brunkow just won the Nobel Prize in Physiology or Medicine 2025 for their (she was jointly awarded) discoveries concerning peripheral immune tolerance.

Brunkow, meanwhile, got the news of her prize from an AP photographer who came to her Seattle home in the early hours of the morning. She said she had ignored the earlier call from the Nobel Committee. “My phone rang and I saw a number from Sweden and thought: ‘That’s just, that’s spam of some sort.’”

The reason why this is worth sharing (besides their fantastic work) is that the Nobel Prize always has a “Popular Science” publication with nice layman descriptions of what was discovered and why it was important. It’s an in-depth look at the discoveries at a university level I would say. Worth a read!

Agent-based Systems for Modeling Wealth Distribution

A question Gary Stevenson, the self-proclaimed best trader in the world, has been asking for some time is if a wealth tax can fix Britain’s economy.

[…] he believed the continued parlous state of the economy would halt any interest rate hikes. The reason? Because when ordinary people receive money, they spend it, stimulating the economy, while the wealthy tend to save it. But our economic model promotes the concentration of wealth among a select few at the expense of everybody else’s living standards.

Owen Jones on Gary Stevenson for The Guardian

Something I generally find very useful and appealing is visualizing systems, models and complexities. Wealth distribution definitely classifies as such a complex system. The Affine Wealth Model

a stochastic, agent-based, binary-transaction Asset-Exchange Model (AEM) for wealth distribution that allows for agents with negative wealth

elegantly demonstrates how random transactions inevitably lead to Pareto distributions without intervention. In this Jupyter Notebook Fabio Manganiello provides great visualizations of the wealth model. He shows how wealth distributes in an open market where a set of agents trades without any mechanisms in place to prevent a situation of extreme inequality. Animation of two side-by-side histogram charts showing wealth distribution. Left chart titled Wealth Distribution (wealth tax: 0%) Right chart titled Wealth Distribution (wealth tax: 5%) It can be seen in the left graph that with no tax wealth quickly stashes up in the pockets of a very small group of agents, while most of the other agents end up piling up in the lowest bucket. As we introduce a wealth tax of 25%, then 5% (right graph) and 1% we can see how the distribution becomes more even and therefore more desirable from the perspective of wealth equality, and also very stable over time, with the agents in the highest buckets quickly having at most 3-4x of their initial amount.

As with any model, the paper as well as the simulation have it’s limitations, but again my interest is more in the way a few lines of code can visualize a economic relationships elegantly. It would be interesting to further investigate: (1) How sensitive are these equilibrium distributions to the transaction constraint (max_exchanged_share)? Does allowing larger transfers accelerate concentration or fundamentally alter the steady-state Gini coefficient? (2) The above wealth tax implementation taxes the sender - but what happens if we model progressive taxation on received amounts above median wealth instead? Does the locus of taxation matter for distributional outcomes?

Visualizing Gradients with PyTorch

Gradients are one of the most important concepts in calculus and machine learning, but it’s often poorly understood. Trying to understand them better myself, I wanted to build a visualization tool that helps me develop the correct mental picture of what the gradient of a function is. I came across GistNoesis/VisualizeGradient, so I went on from there to write my own iteration. This mental model generalizes beautifully to higher dimensions and is crucial for understanding optimization algorithms. 2D Gradient Plot: The colored surface shows function values. Black arrows show gradient vectors in the input plane (x-y space), pointing toward the direction of steepest ascent. The colored surface shows function values. Black arrows show gradient vectors in the input plane (x-y space), pointing toward the direction of steepest ascent.

If you are interested in having a closer look or replicating my approach, the full project can be found on my GitHub. I’m also looking forward to doing something similar on the Central Limit Theorem as well as doing a short tutorial on plotting options volatility surfaces with python, a project I have been waiting to finish for some time now.

Sentiment Trading Revisited

Interesting new paper that builds on many of the ideas I explored in this project. The research, by Ayaan Qayyum, an Undergraduate Research Scholar at Rutgers, shows that the core concept of using advanced language models for sentiment trading is not only viable but highly effective. The study takes a similar but more advanced approach. Instead of using a model like GPT-3.5 to generate a simple sentiment score, it uses OpenAI’s embedding models to convert news headlines into rich, high-dimensional vectors. By training a battery of neural networks including

Gated Recurrent Units (GRU), Hidden Markov Model (HMM), Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and a Feed-Forward Neural Network (FFNN). All were implemented using PyTorch.

on these embeddings alongside economic data, the study found it could reduce prediction errors by up to 40% compared to models without the news data.

The most surprising insight to me, and one that directly addresses the challenge of temporal drift I discussed, was that Qayyum’s time-independent models performed just as well, if not better, than the time-dependent ones. By shuffling the data, the models were forced to learn the pure semantic impact of a headline, independent of its specific place in time. This suggests that the market reacts to the substance of news in consistent ways, even if the narratives themselves change.

Counting Cards with Computer Vision

After installing Claude Code

the agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster through natural language commands

I was looking for a task to test its abilities. Fairly quickly we wrote less than 200 lines of python code predicting black jack odds using Monte Carlo Simulation. When I went on to test this little tool on Washington Post’s online Black Jack (I also didn’t know that existed!) I quickly noticed how impractical it was to manually input all the card values on the table manually. What if the tool would also automatically recognize the cards that are on the table and calculate the odds from it? I have never done anything with computer vision so this seemed like a good challenge. alt text here To get to any reasonable result we have to start with classification where we “teach” the model to categorize data by showing them lots of examples with correct labels. But where do the labels come from? I manually annotated 409 playing cards across 117 images using Roboflow Annotate (at first I only did half as much - why this wasn’t a good idea we’ll see in a minute). Once enough screenshots of cards were annotated we can train the model to recognize the cards and predict card values on tables it has never seen before. I was able to use a NVIDIA T4 GPU inside Google Colab which offers some GPU time for free when capacity is available. alt text here During training, the algorithm learns patterns from this example data, adjusting its internal parameters millions of times until it gets really good at recognizing the differences between categories (in this case different cards). Once trained, the model can then make predictions on new, unseen data by applying the patterns it learned. With the annotated dataset ready, it was time to implement the actual computer vision model. I chose to run inference on Ultralytics’ YOLOv11 pre-trained model, a state-of-the-art object detection algorithm. I set up the environment in Google Colab following the “How to Train YOLO11 Object Detection on a Custom Dataset” notebook. After extracting the annotated dataset from Roboflow, I began training the model using the pre-trained YOLOv11s weights as a starting point. This approach, called transfer learning, allows the model to leverage patterns already learned from millions of general images and adapt them to this specific task. I initially set it up to run for 350 epochs, though the model’s built-in early stopping mechanism kicked in after 242 epochs when no improvement was observed for 100 consecutive epochs. The best results were achieved at epoch 142, taking around 13 minutes to complete on the Tesla T4 GPU. The initial results were quite promising, with an overall mean Average Precision (mAP) of 80.5% at IoU threshold 0.5. Most individual card classes achieved good precision and recall scores, with only a few cards like the 6 and Queen showing slightly lower precision values. Training results showing confusion matrix and loss curves However, looking at the confusion matrix and loss curves revealed some interesting patterns. While the model was learning effectively (as shown by the steadily decreasing loss), there were still some misclassifications between similar cards, particularly among the numbered cards. This highlighted exactly why I mentioned earlier that annotating only half the amount of data initially “wasn’t a good idea” - more training examples would likely improve these edge cases and reduce confusion between similar-looking cards. My first attempt at solving the remaining accuracy issues was to add another layer to the workflow by sending the detected cards to Anthropic’s Claude API for additional OCR processing. Roboflow workflow with Claude API integration This hybrid approach was very effective - the combination of YOLO’s object detection to dynamically crop down the Black Jack table to individual cards with Claude’s advanced vision capabilities yielded 99.9% accuracy on the predicted cards. However, this solution came with a significant drawback: the additional API layer consumed valuable time and the large model’s processing overhead, making it impractical for real-time gameplay. Seeking a faster solution, I implemented the same workflow locally using easyOCR instead. EasyOCR seems to be really good at extracting black text on white background but might struggle with everything else. While it was able to correctly identify the card numbers when it detected them, it struggled to recognize around half of the cards in the first place - even when fed pre-cropped card images directly from the YOLO model. This inconsistency made it unreliable for the application. Rather than continue band-aid solutions, I decided to go back and improve my dataset. I doubled the training data by adding another 60 screenshots with the same train/test split as before. More importantly, I went through all the previous annotations and fixed many of the bounding polygons. I noticed that several misidentifications were caused by the model detecting face-down dealer cards as valid cards, which happened because some annotations for face-up cards inadvertently included parts of the card backs next to them. The improved dataset and cleaned annotations delivered what I was hoping for: The confusion matrix now shows a much cleaner diagonal pattern, indicating that the model now correctly identifies most cards without the cross-contamination issues we saw earlier. Final training results with improved dataset Both the training and validation losses converge smoothly without signs of overfitting, while the precision and recall metrics climb steadily to plateau near perfect scores. The mAP@50 reaches an impressive 99.5%. Most significantly, the confusion matrix now shows that the model has virtually eliminated false positives with background elements. The “background” column (rightmost) in the confusion matrix is now much cleaner, with only minimal misclassifications of actual cards as background noise. Real-time blackjack card detection and odds calculation With the model trained and performing, it was time to deploy it and play some blackjack. Initially, I tested the system using Roboflow’s hosted API, which took around 4 seconds per inference - far too slow for practical gameplay. However, running the model locally on my laptop dramatically improved performance, achieving inference times of less than 0.1 seconds per image (1.3ms preprocess, 45.5ms inference, 0.4ms postprocess per image). I then integrated the model with MSS to capture a real-time feed of my browser window. The system automatically overlays the detected cards with their predicted values and confidence scores Overview of selected fitted curves The final implementation successfully combines the pieces: the computer vision model detects and identifies cards in real-time, feeds this information to the Monte Carlo simulation, and displays both the card recognition results and the calculated odds directly on screen - do not try this at your local (online) casino!

NVIDIA Likes Small Language Models

A Small Language Model (SLM) is a LM that can fit onto a common consumer electronic device and perform inference with latency sufficiently low to be practical when serving the agentic requests of one user. […] We note that as of 2025, we would be comfortable with considering most models below 10bn parameters in size to be SLMs.

The (NVIDIA) researchers argue that most agentic applications perform repetitive, specialized tasks that don’t require the full generalist capabilities of LLMs. They propose heterogeneous agentic systems where SLMs handle most tasks while LLMs are used selectively for complex reasoning. They present three main arguments: (1) SLMs are sufficiently powerful for agentic tasks, as demonstrated by recent models like Microsoft’s Phi series, NVIDIA’s Nemotron-H family, and Hugging Face’s SmolLM2 series, which achieve comparable performance to much larger models while being 10-30x more efficient. (2) SLMs are inherently more operationally suitable for agentic systems due to their faster inference, lower latency, and ability to run on edge devices. (3) SLMs are necessarily more economical, offering significant cost savings in inference, fine-tuning, and deployment.

The paper addresses counterarguments about LLMs’ superior language understanding and centralization benefits with studies (see Appendix B: LLM-to-SLM Replacement Case Studies) showing that 40-70% of LLM queries in popular open-source agents (MetaGPT, Open Operator, Cradle) could be replaced by specialized SLMs. One comment I read raised important concerns about the paper’s analysis, particularly regarding context window which are arguably the highest technical barrier to SLM adoption in agentic systems. Modern agentic applications require substantial context: Claude 4 Sonnet’s system prompt alone reportedly uses around 25k tokens, and a typical coding agent needs system instructions, tool definitions, file context, and project documentation, totaling 5-10k tokens before any actual work begins. Most SLMs that can run on consumer hardware are capped at 32k or 128k contexts architecturally, but achieving reasonable inference speeds at these limits requires gaming hardware (8GB VRAM for a 7b model at 128k context).

The paper concludes that the shift to SLMs is inevitable due to economic and operational advantages, despite current barriers including infrastructure investment in LLM serving, generalist benchmark focus, and limited awareness of SLM capabilities. But the economic efficiency claims also face scrutiny under system-level analysis. In Section 3.2 they present simplistic FLOP comparisons while ignoring critical inefficiencies: the reliance on multishot-prompting where SLMs might require 3-4 attempts for tasks that LLMs complete with 90% success rate, task decomposition overhead that multiplies context setup costs and error rates, and infrastructure efficiency differences between optimized datacenters (PUE ratios near 1.1, >90% GPU utilization) and consumer hardware (5-10% GPU utilization, residential HVAC, 80-85% power conversion efficiency). When accounting for failed attempts, orchestration overhead, and infrastructure efficiency, many “economical” SLM deployments might actually consume more total energy than centralized LLM inference.

(05/07/2025) Update: On the topic of speed I just came across Ultra-Fast Language Models Based on Diffusion. You can also test it yourself using the free playground link, and it is in fact extremely fast. Try the “Diffusion Effect” in the top right corner which toggles an interesting visualization. I’m not sure how realistic this is, it shows text appearing as random noise before gradually resolving into clear words; though the actual process likely involves tokens evolving from imprecise vectors in a multidimensional space toward more precise representations until they crystallize into specific words.

(06/07/2025) Update II: Apparently there is also a Google DeepMind Gemini Diffusion Model.

Novo Nordisk's Post-Patent Strategy

Novo Nordisk, a long time member of my “regrets” stock list, has become reasonably affordable lately (-48% yoy). Part of the reason being that they currently sit atop a ~$20 billion Ozempic/Wegovy franchise that faces patent expiration in 2031. That’s roughly seven years to replace their blockbuster drug. We revisit them today, since per newly published Lancet data, Novo’s lead replacement candidate—amycretin—just posted some genuinely impressive Phase 1 results. The injectable version delivered 24.3% average weight loss versus 1.1% for placebo, beating both current market leaders (Wegovy at 15% and Lilly’s Zepbound at 22.5%). Even the oral version hit 13.1% weight loss in just 12 weeks, with patients still losing weight when the trial ended.

Amycretin is very elegantly designed: It combines semaglutide (the active ingredient in Ozempic/Wegovy) with amylin, creating what’s essentially a dual-pathway satiety signal. Semaglutide activates GLP-1 receptors to slow gastric emptying and reduce appetite centrally, while amylin works through complementary mechanisms to enhance fullness signals. This way both your stomach and your brain’s “appetite control center” are getting the “stop eating” message simultaneously. One concern raised by Elaine Chen at STAT is that the results of a Phase 1/2 study include unusual findings around dosage. The full text article is behind a paywall unfortunately, so I did not have access. However, looking at the actual data from the study, I am assuming she is referring to Parts C, D, and E, which tested maintenance doses of 20 mg, 5 mg, and 1.25 mg respectively. The weight loss results were:

Part C (20 mg): -22.0% weight loss at 36 weeks
Part D (5 mg): -16.2% weight loss at 28 weeks
Part E (1.25 mg): -9.7% weight loss at 20 weeks

While there is a dose-response relationship, what’s notable is the curves in Figure 3 show relatively similar trajectories during the overlapping time periods. Typically in drug development, researchers would expect clear separation between dose groups (with higher doses producing proportionally greater effects). When weight-loss curves overlap significantly (which they do in this case), it suggests the doses may be producing similar effects despite different drug concentrations. If lower doses produce similar weight loss with potentially fewer side effects, this could favor using the lower, better-tolerated dose. Further, it might indicate that amycretin reaches maximum effect at relatively low doses. This should probably influence how future Phase 3 trials are designed, potentially focusing on the optimal dose rather than the maximum tolerated dose. Given that gastrointestinal side effects were dose-dependent but efficacy curves overlapped, this supports using the lowest effective dose. How that might be a bad thing I have yet to find out.

From a financial perspective, Novo Nordisk’s pipeline is very interesting: Amycretin’s injectable version is currently in Phase 2, suggesting Phase 3 trials around 2026-2027, with potential approval by 2031; basically right as the Ozempic patents expire. But Novo isn’t betting everything on amycretin. They’re running what appears to be a diversified pipeline strategy with multiple shots on goal: NNC-0519 (another next-gen GLP-1), NNC-0662 (details kept confidential), and cagrilintide combinations. This makes sense: you want multiple candidates because the failure rate in drug development makes even the most promising compounds statistically likely to fail. Eli Lilly’s tirzepatide (Mounjaro/Zepbound) works through a different mechanism—GLP-1 plus GIP receptor activation—and appears to be gaining market share. Lilly’s orforglipron, an oral GLP-1 that hit 14.7% weight loss in Phase 2, represents another competitive threat. Judging by LLY’s price development, investors currently seem to think that Lilly is doing a better job at architecting a portfolio than Novo (or at least providing more disclosure about their pipeline). Yet, the overall competitive landscape might actually benefit both companies. The “war” between Novo and Lilly is expanding the overall market for obesity treatments, potentially growing the pie faster than either company is losing share. Also, to analyze the financial impact of the expiring Ozempic patents, we have to look further than just Novo’s research pipeline. Manufacturing these GLP-1 compounds and their delivery devices is “pretty tough.” Complex peptides requiring specialized manufacturing capabilities, plus the injection devices themselves are patent-protected. This creates what we would call a capacity constraint moat in corporate strategy. Novo’s manufacturing capabilities/partnerships and injectable device patents are a key competitive advantage. Even when semaglutide goes generic in 2031, the entire generic pharmaceutical industry would essentially need to coordinate to build sufficient manufacturing capacity to meaningfully dent Novo’s market share. Meanwhile, Novo could potentially defend by lowering prices while maintaining manufacturing advantages in a monopoly-to-oligopoly transition.

The other day I came across Martin Shkreli’s NOVO model. Conservatively, it puts Novo’s fair value around 705 DKK (21% upside from ~585 DKK), while a failure scenario drops valuation to 385 DKK. The range reflects what you’d expect for a large-cap pharmaceutical company;the market has already incorporated most knowable information about pipeline risks and patent timelines. This also underscores the point that manufacturing capabilities and continuous innovation pipelines can potentially maintain quasi-monopolistic positions longer than traditional patent protection would suggest. Shkreli’s analysis suggests Novo Nordisk is reasonably valued with modest upside potential, contingent on successful pipeline execution. Novo Nordisk is at a critical juncture, with substantial franchise value dependent on successful pipeline execution over the next 7-8 years. While the current valuation appears reasonable, the binary nature of drug development success creates both upside potential and significant downside risk.

This article is for informational purposes only, you should not consider any information or other material on this site as investment, financial, or other advice. There are risks associated with investing.