NNuggets
BookmarksCollections
  • About Us
  • Terms of use
  • Privacy policy
  • Disclaimer
  • Copyright & Takedown Policy
  • Community Guidelines
  • Cookie Policy
  • Contact

© 2026 Nuggets

NuggetsMarket PulseCollections

On this page

Speakers & Credentials

  • Speakers & Credentials
  • 1. Executive Summary
  • 2. Chronological Table of Contents
  • 3. Detailed Thematic Summary
  • The Reference Vault
  • 4. Data & Figures
  • 5. Core Frameworks & Mental Models
  • 6. Anecdotes
  • 7. References & Recommendations
  • 8. The Bottomline (by AI)

On this page

  • Speakers & Credentials
  • 1. Executive Summary
  • 2. Chronological Table of Contents
  • 3. Detailed Thematic Summary
  • The Reference Vault
  • 4. Data & Figures
  • 5. Core Frameworks & Mental Models
  • 6. Anecdotes
  • 7. References & Recommendations
  • 8. The Bottomline (by AI)
Technology/May 29, 2026/14 min read/youtu.be

Inference, Diffusion, World Models, and More | YC Paper Club | 29 May 2026 | Y Combinator

Source
Source
Watch on YouTube ↗

"If you have a method, an algorithm, a system where its performance scales with the amount of thinking it does, then fundamentally the speed at which you can do inference, the tokens per second, is exactly the peak intelligence that you can deliver." - Tanishk [00:06:03]

"With DMPC, what we did is to use diffusion models to learn both multi-step action proposals and multi-step dynamics models... it can simplify the planning algorithm." - Stannis [00:20:38]

References

  1. Original source (youtu.be)

Disclaimer: Orignal content owned by or sourced from third parties. It does not represent the views of 'Nuggets' platform or it's team. AI is used extensively across this platform including for summaries. Accuracy is not guaranteed, there can be mistakes. Any info or content on this platform is not a financial, legal, or investment advice. Do your own research. Refer for complete disclosures:- Terms of Use · Full Disclaimer

Related nuggets

Jun 2, 2026

AI Is Escaping the Screen | 01 Jun 2026 | Coatue

Coatue : AI is entering a new phase: moving beyond digital tools and into fully autonomous systems operating in the physical world. From advanced manufacturing and surgical robotics to robots in the home, the next wave of innovation will b…

Jun 2, 2026

Kalshi Monthly Volume - Politics ($M) | Chart of the Day | Coatue

Coatue: Kalshi's political volume has scaled dramatically, and the American Power Index KPOW is what that scale enables: a single number gauge of the current balance of political power and where markets expect it to move, which Kalshi bill…

Jun 2, 2026

The BlackBerry Problem |18 May 2026 | The Mistakes Series | Malcolm Gladwell's Revisionist History

"My mistake and naivity was to think that people are were with me so you're flying around the world you're trying to get people on side and you think they're on side but they're not mhm mhm and you get blindsight" Jim Balsillie 00:01:34 ht…

Jun 2, 2026

Partnership Perspectives: Network International | 2 Jun 2026 | Brookfield Perspectives

Actions

Reading

Published
May 29, 2026
Read time
14 min read
Progress0%

"The question is still out on what the best way to do it is, but it's about 50 times faster than any of the competition across the board because it's doing all this work in the latent space." - Isaac Ward [00:41:39]

"Deep learning so-called mysteries are actually consistent and partially explained by existing theories such as soft inductive biases and PAC-Bayes." - Ashe [00:49:32]

"What this suggests is that as time passes on, the amount of compute that we're willing to spend per data point is going to continue to increase by roughly 4x year-over-year." - Ku [00:53:01]


Speakers & Credentials

  • Host: Y Combinator alumnus (Winter 16 batch) and ecosystem organizer hosting the inaugural YC Paper Club at the historic Pioneer location.
  • Tanishk: Graduate student at Stanford University; core systems researcher specializing in machine learning inference optimization alongside Tri Dao and Aravind Srinivas.
  • Stannis: Senior Research Scientist at Google DeepMind; co-leading foundational projects on world modeling, diffusion-based agents, and general-purpose robotics control.
  • Isaac Ward: Autonomous systems and world models researcher; exploring architectural regularizations for Joint Embedding Predictive Architectures (JEPA) under Yann LeCun's framework guidance.
  • Ashe: President and Co-Founder of Q Labs; collaborator alongside leading academic groups analyzing neural network generalization limits and optimization constraints.
  • Ku (Con Woo): Machine learning researcher collaborating within Christopher Ré’s lab and Percy Liang's Stanford ecosystem; focusing on empirical scaling laws in data-starved regimes.

1. Executive Summary

  • The inaugural YC Paper Club marks the establishment of a specialized community hub intended to bridge elite Silicon Valley startup founders with pioneering machine learning researchers.
  • Inference Efficiency: Speculative Speculative Decoding (SSD) completely restructures large language model inference by executing drafting and validation loops in parallel on decoupled hardware layers.
  • Robotic Planning: Diffusion Model Predictive Control (DMPC) factorizes action proposers from environment transitions, neutralizing compounding errors and enabling rapid physical self-adaptation without dense runtime planning overhead.
  • Latent World Models: Implementing localized regularizers like the SIGG operator completely mitigates representational collapse in Joint Embedding Predictive Architectures, running 50x faster than traditional visual-rendering equivalents.
  • Generalization Demystified: Classical mathematical frameworks, specifically PAC-Bayes constraints, show that deep learning behaviors like overparameterization and benign overfitting follow predictable statistical rules when properly measured via compression metrics.
  • Data Limits: In upcoming data-constrained pre-training regimes, traditional empirical methods fail; applying extreme weight decay regularizations, multi-model ensembling, and downstream self-distillation preserves high model efficiency under infinite compute bounds.

2. Chronological Table of Contents

  • 00:00:07 - Welcome & Institutional Introduction to YC Paper Club
  • 00:03:45 - Talk 1: Speculative Speculative Decoding (Tanishk)
  • 00:17:32 - Talk 2: Diffusion Model Predictive Control (Stannis)
  • 00:29:50 - Talk 3: Latent World Models & Yann LeCun's JEPA (Isaac Ward)
  • 00:43:21 - Talk 4: Deep Learning Generalization & PAC-Bayes (Ashe)
  • 00:50:19 - Talk 5: Scaling Pre-Training Under Infinite Compute & Data Constraints (Ku)

3. Detailed Thematic Summary

Welcome & Institutional Context [00:00:07]

  • Over 1,000 elite individuals applied to attend the initial event, forcing coordinators to restrict admission to a hyper-curated group of roughly 100 selected founders and researchers [00:00:21].
  • The organizational mission is to revitalize the historic YC Pioneer office, a location that produced the Winter 16 batch where 140 companies yielded 10 to 15 distinct tech unicorns, including Webflow, Astronis, and Deepgram [00:01:55].
  • During the early formation of OpenAI, core figures such as Andrej Karpathy, Wojciech Zaremba, and Greg Brockman routinely brainstormed in this exact facility, asking non-AI founders about software friction points to guide their primary research maps [00:02:16].
  • Organizers note that roughly half of the top-tier AI talent in the Bay Area resides outside San Francisco proper, creating a clear need for a peninsula hub to coordinate researchers across Google DeepMind, Tesla, XAI, and Stanford [00:02:54].

Talk 1: Speculative Speculative Decoding (SSD) [00:03:45]

  • Tanishk reframes inference speed entirely as a foundational capability rather than an operational cost lever, proving that test-time tokens-per-second directly bound the peak delivered intelligence of an AI network [00:05:45].
  • Vanilla speculative decoding utilizes a small draft model to auto-regressively generate token sequences, which a large target model then verifies in parallel during a single forward pass [00:08:01].
  • If a token is rejected during validation, vanilla configurations utilize parallel log probabilities to sample a single "bonus token" completely for free without additional forward compute steps [00:10:15].
  • Speculative Speculative Decoding (SSD) bypasses the chronological bottleneck of vanilla systems by parallelizing the drafting and verification pipelines across decoupled hardware layers [00:12:03].
  • The draft layer tracks probability tables to predict large model verification outcomes ahead of time, guessing fallback bonus tokens with an 80% to 90% accuracy rate [00:13:51].
  • This pipelining architecture successfully yields inference speeds of 300 tokens per second for Llama-3-70B running across 4 standard H100 GPUs [00:17:12].

Talk 2: Diffusion Model Predictive Control (DMPC) [00:17:32]

  • Stannis presents Model Predictive Control (MPC) as an architecture that pairs a dynamics world model with an action selection framework to navigate highly novel test-time reward curves [00:19:15].
  • Standard operational deployment of traditional MPC is routinely limited by severe compounding errors across long horizons and heavy runtime planning compute requirements [00:20:25].
  • DMPC integrates diffusion modeling to concurrently predict multi-step action proposals and multi-step world state trajectories directly from highly diverse offline datasets [00:20:47].
  • This modularity cleanly isolates action proposals from transition dynamics; if an active hardware agent experiences severe structural modification—such as a robotic walker encountering a broken left ankle—the transition model can be rapidly re-trained on raw play data while leaving the core action proposer untouched [00:28:16].

Talk 3: Latent World Models & Yann LeCun's JEPA [00:29:50]

  • Isaac Ward explains the core motivation behind Yann LeCun's company raising $1.03 billion to build explicit internal world representations instead of relying on model-free end-to-end architectures [00:30:55].
  • Co-learning a high-dimensional observation space alongside action trajectories frequently causes "representational collapse," where the network trivializes states into uniform mathematical points to minimize loss [00:36:11].
  • "Latent World Model" implements a SIGG regularizer (Sketching, Isotropic, Gaussian distributed) to map high-dimensional embeddings into clean, un-collapsed Gaussian patterns across cheap, one-dimensional slices [00:39:04].
  • Running all planning checks entirely inside this latent layer allows the architecture to run up to 50 times faster than pixel-generating alternatives [00:41:39].
  • The system functions reliably within a compact 15-million parameter footprint, needing less than 24 gigabytes of VRAM to deploy on single consumer-grade hardware [00:41:52].
  • Explicit world tracking also enables clear surprise quantification; unexpected environment changes cause instant, measurable spikes in the internal model error [00:41:57].

Talk 4: Deep Learning Generalization & PAC-Bayes [00:43:21]

  • Ashe reviews Andrew Gordon Wilson's research to explain why massive overparameterized systems generalize effectively rather than overfitting training data [00:43:54].
  • The mathematical PAC-Bayes framework establishes that test loss is strictly bounded by training loss added to an underlying compression penalty term [00:45:00].
  • Scaling model weights increases the volume of flat minima within the parameter landscape exponentially compared to brittle, sharp configurations [00:47:07]. Because flat minima require far fewer informational bits to encode, overparameterization naturally delivers highly compressible solutions [00:47:28].
  • Neural networks operate as deeply expressive models constrained by a soft inductive bias. This allows them to seamlessly accommodate random training noise while favoring simpler, highly generalizable patterns when handling structured data [00:47:56].

Talk 5: Data-Constrained Pre-Training Scaling Laws [00:50:19]

  • Ku models a data-scarce future where human internet text grows at only 3% annually, while compute spending rises 4x to 5x per year, causing compute per data point to increase roughly 4x year-over-year [00:52:42].
  • To track this regime, researchers isolated a 200-million token dataset from the DCLM web corpus to see how models scale when data is fixed but compute is effectively infinite [00:55:05].
  • Standard epoch training quickly triggers severe overfitting [00:56:04]. Applying an aggressive weight decay regularization—tuned 30 times higher than normal compute-optimal setups—uncovers a clean scaling law with a defined loss asymptote of 3.43 [00:56:28].
  • Ensembling smaller models yields massive data efficiency gains over scaling single dense networks [00:57:43]. Combining both ensembling and regularization scales efficiency up to a 5x token multiplier [01:00:11].
  • These training gains can be completely retained at inference time by distilling an 8-model ensemble down into a single dense 300-million parameter model, preserving 83% of the loss improvement [01:02:44].
  • When applied to domain-specific continued pre-training (CPT) on a mathematical corpus, these data-efficiency methods matched the performance of a full 73-billion token dataset while using only 4 billion tokens, a 17x sample efficiency win [01:04:45].

The Reference Vault

4. Data & Figures

Data PointValueContextTimestamp
Paper Club Applicants1,000+Total digital applications received for the first cohort meetup.[00:00:21]
Cohort Target Volume~100Target attendance ceiling enforced to protect collaborative density.[00:00:31]
Historic W16 Yield140 Cos / 10-15 UnicornsThe conversion performance metrics tracking the historic batch.[00:01:55]
Bay Area Geographic Split50% / 50%Distribution split of AI researchers between SF and Peninsula hubs.[00:02:54]
SSD Cache Hit Estimate

5. Core Frameworks & Mental Models

  • Inference-Bounded Peak Intelligence: A framework showing that when performance scales with runtime compute, system intelligence becomes bounded by output tokens per second [00:06:03].
  • Model Predictive Control (MPC) Factorization: Dividing robotic architecture into a separate multi-step action proposer and an independent world dynamics model, allowing physical adaptation without broken actions [00:21:57].
  • Representational Collapse: An optimization vulnerability where a world model maps high-dimensional spaces down into uniform mathematical points to minimize loss, effectively breaking downstream control [00:36:11].
  • SIGG Regularization: A joint embedding architecture method that maps high-dimensional latent vectors into clean, un-collapsed isotropic Gaussian distributions by applying cheap, one-dimensional mathematical slices across directions [00:39:04].
  • Flat vs. Sharp Minima Compressibility: A framework showing that scaling parameter volume exponentially expands flat regions in the parameter landscape over sharp ones. Because flat minima require far fewer bits to encode, overparameterized systems find highly compressed, generalized solutions naturally [00:47:07].
  • Infinite-Compute Loss Asymptotes: An analytical tool that maps scaling laws into power laws to calculate a recipe's horizontal asymptote, directly quantifying its maximum performance threshold if compute were infinite [00:54:12].

6. Anecdotes

  • The Woodside Breakfast: The host shares a breakfast meeting with Harj Bhujoua in Woodside that sparked the decision to repurpose YC's underutilized Pioneer office as a dedicated baseline community hub for local Peninsula AI talent [00:01:42].
  • The Winter 16 OpenAI Interactions: The host recalls sitting alongside early OpenAI engineers during his W16 batch. The team would routinely ask non-AI founders about daily software challenges because they were still searching for practical, high-value real-world vectors to guide their initial research focus [00:02:16].
  • The "Scooped" Diffusion Policy Discovery: The host recounts spending a month developing a robotic video control loop based on the Diffusion Policy paper, only to realize midway through that Google DeepMind had quietly executed and published the exact same insight six months prior [00:17:39].
  • The Broken Left Ankle Walker: Stannis illustrates DMPC's structural agility through a simulated walking agent. When the walker's left ankle is suddenly broken, the factorized system updates its internal dynamics layer with fresh play data, restoring full movement without changing the action proposer [00:28:16].
  • The Teapot Teleportation Test: Isaac Ward shows how Latent World Models handle visual anomalies. When an object like a T-shaped block is instantly teleported across a tracking frame, the system's internal error metric spikes predictably, proving it can actively flag environmental surprises [00:41:57].
  • The "Holding Out Tests" Advisory: Ku notes that their downstream evaluations were completely hidden from view until the absolute end of the project, strictly following advice from senior advisors to protect the validation data from accidental human optimization bias [01:04:14].

7. References & Recommendations

Academic Papers & Literature

  • Speculative Speculative Decoding (SSD): Presented by Tanishk; introduces parallelized drafting and verification mechanics to accelerate large-scale inference environments [00:03:45].
  • Diffusion Policy Paper: Cited as the foundational structural blueprint for multi-step horizon control paths within modern robotic actuation [00:17:39].
  • Richard Sutton's NeurIPS 1990 Paper: Cited by Isaac Ward to prove that black-box environment prediction systems are foundational reinforcement learning concepts rather than new modern paradigms [00:32:24].
  • Latent World Model (JEPA): Formulated under Yann LeCun's group to perform fast action searches entirely inside latent configurations using a SIGG regularizer [00:29:50].
  • Deep Learning is Not So Mysterious or Different: Authored by Andrew Gordon Wilson; uses PAC-Bayes frameworks and soft inductive biases to unpack overparameterization mysteries [00:43:54].
  • Lotfi et al.: Academic reference detailing explicit negative correlation paths between training set encoding bits and total parameter scaling counts [00:46:39].
  • Chinchilla Scaling Laws: Classical reference detailing compute-optimal resource ratios across total model parameters and token counts [00:52:36].

Companies & Corporate Entities

  • OpenAI: Identified as an early institutional peer collaborator within the Winter 16 YC ecosystem framework [00:02:16].
  • Google DeepMind: Flagged as a proximate geographic anchor and primary employer for core technical robotic staff [00:03:15].
  • Tesla / XAI / Thinking Machines: Noted as prominent hardware-dense engineering nodes situated outside of San Francisco proper [00:03:15].
  • Q Labs: Introduced as the corporate vehicle driving commercial generalization and PAC-Bayes application pipelines [00:43:54].

People

  • Chris Manning: Commended as an academic benchmark tracking massive citation volume density across historical language modeling [00:01:18].
  • Sam Altman: Noted as the operational manager of Y Combinator during the foundational Winter 16 batch phase [00:02:16].
  • Tri Dao & Aravind Srinivas: Acknowledged as close systemic research collaborators on Tanishk's accelerated speculative inference engine project [00:04:01].
  • Yann LeCun: Referenced as the strategic director guiding structural funding deployments to bypass end-to-end model constraints [00:30:55].
  • Andrew Gordon Wilson: Citational anchor for statistical proofs confirming why neural network scaling preserves generalizability [00:43:54].
  • Percy Liang & Christopher Ré: Acknowledged as the guiding university laboratory directors managing data-constrained optimization limits [00:51:24].

8. The Bottomline (by AI)

As machine learning transitions from raw pre-training expansion to localized architectural refinement, token throughput is quickly emerging as the primary bottleneck defining peak system intelligence. Navigating upcoming human data caps requires shifting industrial compute away from brute end-to-end training and toward factorized diffusion controllers, parallelized inference engines, and latent-space world models. For developers and technical leaders, incorporating robust, un-collapsing joint embedding frameworks and aggressive ensembling methods offers a clear path to unlocking massive sample efficiency gains as public web data hits its natural limits.

"Brookfield's the largest infrastructure owner in the world... We drew a pipeline and we showed all the different components of the payments ecosystem on a pipeline and said it's like a pipe that moves any commodity except what it's moving…

80% - 90%
Success rate of the draft layer predicting target model verification outcomes.
[00:14:17]
Parallel Inference Baseline300 tokens/secOutput speed achieved for Llama-3-70B using 4 H100 GPUs.[00:17:12]
LeCun Group Funding$1.03 BillionTotal capital raised in March to train foundational world models.[00:30:55]
Latent Model Profile15M params / <24GB VRAMMaximum compute bounds required to host the Latent World Model framework.[00:41:52]
Latent Operational Win50xPerformance gain over pixel-rendering model competitors.[00:41:39]
Public Internet Expansion3% per yearAnnual growth rate of human-generated text on the public internet.[00:52:48]
Pre-training Compute Inflow4x - 5x per yearAnnual growth rate of industrial compute allocations.[00:52:54]
Per-Token Compute Escalation~4x year-over-yearRate of change defining compute spend per individual token.[00:53:01]
Simulated Data Cap200 Million tokensSub-sample of DCLM data utilized to construct data-constrained scaling laws.[00:55:05]
Aggressive Weight Decay30x normalHyperparameter multiplier over standard compute-optimal configurations.[00:56:53]
Regularized Asymptote3.43The calculated best possible loss performance under infinite compute limits.[00:57:17]
Multi-Model Joint Win5xTotal effective data multiplier gained by combining ensembling and regularization.[01:01:31]
Distillation Retention83%Amount of ensemble loss reduction kept when compressed to a dense 300M model.[01:03:07]
CPT Math Data Profile4B vs 73B tokensRestricted subset size matched against total corpus volume via smart epoching.[01:04:51]
CPT Sample Efficiency Win17xSample efficiency gain achieved during mathematical model adaptation.[01:05:11]