NNuggets
BookmarksCollections
  • About Us
  • Terms of use
  • Privacy policy
  • Disclaimer
  • Copyright & Takedown Policy
  • Community Guidelines
  • Cookie Policy
  • Contact

© 2026 Nuggets

NuggetsMarket PulseCollections

On this page

Speakers & Credentials

  • Speakers & Credentials
  • 1. Executive Summary
  • 2. Chronological Table of Contents
  • 3. Detailed Thematic Summary
  • The Reference Vault
  • 4. Data & Figures
  • 5. Core Frameworks & Mental Models
  • 6. Anecdotes
  • 7. References & Recommendations

On this page

  • Speakers & Credentials
  • 1. Executive Summary
  • 2. Chronological Table of Contents
  • 3. Detailed Thematic Summary
  • The Reference Vault
  • 4. Data & Figures
  • 5. Core Frameworks & Mental Models
  • 6. Anecdotes
  • 7. References & Recommendations
Technology/April 16, 2026/17 min read/youtu.be

Advancing to AI's Next Frontier: Insights From Jeff Dean and Bill Dally | NVIDIA Developer

Source
Source
Watch on YouTube ↗

"I feel like the models have gotten way way better at sort of problems with verifiable rewards so things like math encoding... you saw like our Gemini model entered the IMO contest and got a gold medal." - Jeff Dean [00:01:00]

"we basically can achieve um you know two millimeters per nanosecond basically the time of flight over over wires on the chip and get from one corner of the chip to another corner of the chip in in 30 nanoseconds." - Bill Dally [00:05:00]

References

  1. Original source (youtu.be)

Disclaimer: Orignal content owned by or sourced from third parties. It does not represent the views of 'Nuggets' platform or it's team. AI is used extensively across this platform including for summaries. Accuracy is not guaranteed, there can be mistakes. Any info or content on this platform is not a financial, legal, or investment advice. Do your own research. Refer for complete disclosures:- Terms of Use · Full Disclaimer

Related nuggets

Jun 2, 2026

AI Is Escaping the Screen | 01 Jun 2026 | Coatue

Coatue : AI is entering a new phase: moving beyond digital tools and into fully autonomous systems operating in the physical world. From advanced manufacturing and surgical robotics to robots in the home, the next wave of innovation will b…

Jun 2, 2026

Kalshi Monthly Volume - Politics ($M) | Chart of the Day | Coatue

Coatue: Kalshi's political volume has scaled dramatically, and the American Power Index KPOW is what that scale enables: a single number gauge of the current balance of political power and where markets expect it to move, which Kalshi bill…

Jun 2, 2026

The BlackBerry Problem |18 May 2026 | The Mistakes Series | Malcolm Gladwell's Revisionist History

"My mistake and naivity was to think that people are were with me so you're flying around the world you're trying to get people on side and you think they're on side but they're not mhm mhm and you get blindsight" Jim Balsillie 00:01:34 ht…

Jun 2, 2026

Partnership Perspectives: Network International | 2 Jun 2026 | Brookfield Perspectives

Actions

Reading

Published
April 16, 2026
Read time
17 min read
Progress0%

"inference is the job now it's easily you know 90% of of the you know power in data centers is going into inference" - Bill Dally [00:16:46]

"it's a thousand times more energy to read one NVFP4 number from external memory than it is to do a multiply ad... the key thing you can do to reduce energy is to just don't move data." - Bill Dally [00:33:22]

"I think the right sort of way to give you the illusion of attending to say a trillion tokens instead of a million is to have staged forms of much lighter weight retrieval mechanisms..." - Jeff Dean [00:22:30]

"educational outcomes I think are one to two standard deviations higher when you have a personalized tutor than when you have a more group set everybody can have a personalized AI" - Jeff Dean [00:50:12]


Speakers & Credentials

  • Jeff Dean: Chief Scientist at Google (Google DeepMind and Google Research). Pioneer in distributed systems, machine learning infrastructure, and neural network architectures.
  • Bill Dally: Chief Scientist and Senior Vice President of Research at NVIDIA. Former Professor at Stanford University, author of foundational textbooks on interconnection networks, and leading authority on supercomputing and hardware architecture.

1. Executive Summary

  • The transition from models solving simple middle-school math to achieving gold medals in the International Mathematical Olympiad (IMO) and ICPC marks a paradigm shift in AI's capacity to master domains with verifiable reward structures.
  • The hardware bottleneck has fully transitioned from training to inference, with inference now consuming approximately 90% of total data center power, necessitating radical architectural shifts optimized exclusively for decoding stages and low-latency interaction.
  • To enable autonomous AI agents operating at speeds 50x faster than humans, hardware design is moving towards "speed-of-light" latency, achieving sub-30 nanosecond on-chip routing and abandoning heavy digital signal processing on off-chip PHYs to hit 10,000–20,000 tokens per second per user.
  • The physical limit of energy efficiency is dictated by data movement; executing an NVFP4 multiply-add operation costs a mere 10 femtojoules, whereas fetching that data from HBM4 costs 1,000 times more energy, fundamentally forcing a transition toward stacked 3D DRAM architectures.
  • Software and tooling paradigms face an impending Amdahl's Law crisis: AI agents execute logic instantly, but legacy human-speed tools (like C-compiler startup times or spreadsheet macros) will soon become the primary latency bottlenecks in continuous learning and agentic loops.
  • AI is revolutionizing its own underlying silicon layer, with deep reinforcement learning and custom LLMs replacing hundreds of human months of manual placement, routing, and verification, hinting at a future recursive loop where AI agents negotiate the next generation of chip interfaces autonomously.

2. Chronological Table of Contents

  • [00:00:31] - The Evolution of ML Capabilities & Verifiable Rewards
  • [00:03:28] - Latency Dynamics & "Speed of Light" Chip Routing
  • [00:08:08] - Natural Language Neural Architecture Search (NL-NAS)
  • [00:11:44] - Chinchilla Scaling, Synthetic Data, & Data Exhaustion
  • [00:16:00] - Inference Dominance & Workload Bifurcation
  • [00:19:34] - Evolving Attention Models & Trillion-Token Context
  • [00:23:21] - AI-Driven Hardware Design & Agent Swarms
  • [00:32:18] - The Physics of Energy, Sparsity, & Data Movement
  • [00:37:16] - Silicon Code-Design & Dynamic Graph Operations
  • [00:45:14] - Supercomputer Topology: Torus vs. Switch Fabrics
  • [00:48:20] - Societal Impact: Healthcare, Education, & Scale

3. Detailed Thematic Summary

The Evolution of ML Capabilities & Verifiable Rewards [00:00:31]

  • Rapid Maturation in Verifiable Logic: Jeff Dean highlights a massive leap in AI capabilities regarding problems with explicitly verifiable rewards, such as deep mathematics and competitive coding [00:01:00].
  • The Baseline Shift: Just 3-4 years ago, the ML community celebrated models successfully solving simple 8th-grade word problems (e.g., "Fred has four rabbits") with merely a 40% to 50% success rate [00:01:07].
  • Current State of the Art: Google’s Gemini model has radically accelerated past basic logic, recently entering the IMO (International Mathematical Olympiad) and securing a gold medal [00:01:33], alongside winning gold in the ICPC competitive programming contest [00:01:40].
  • Agent-Based Autonomy: The workflow is shifting from synchronous single-prompt tasks to long-running asynchronous agents that operate independently over hours or days [00:02:16], correcting their own errors and executing extensive multi-step workflows without human intervention.
  • AlphaGo-Style Pre-training: Dean anticipates a shift away from models passively pre-training on static internet data. The future points toward a paradigm similar to AlphaGo [00:14:12], where models actively take actions in simulated environments and converse with each other to improve.

Latency Dynamics & "Speed of Light" Chip Routing [00:03:28]

  • The Inference Trade-off Curve: Bill Dally maps out the latency vs. throughput curve. High throughput maximizes tokens per second per dollar/watt, but optimizing for interactivity demands smaller batch sizes, exposing communication latency as the primary bottleneck [00:03:44].
  • Pushing Physical Limits: NVIDIA is pursuing what they term "speed of light" architecture. Typical LLMs utilize 50 to a couple hundred layers [00:04:28], requiring constant on-chip communication between feed-forward and attention stages.
  • On-Chip Routing Breakthroughs: By utilizing static scheduling to eliminate routing overhead, queues, and arbitration, hardware can now transmit data at 2 millimeters per nanosecond [00:05:00], traversing from corner-to-corner of a chip in merely 30 nanoseconds [00:05:07], a dramatic reduction from historical delays of several hundred nanoseconds.
  • Off-Chip PHY Trade-offs: Historically, external communication PHYs were optimized strictly for maximum bandwidth, relying on heavy digital signal processing and Forward Error Correction (FEC) to extract noisy signals [00:05:17]. By intentionally backing off bandwidth—reducing from 400 gigabits per second per pair down to 200 Gbps [00:05:45], Dally notes engineers can eliminate DSP entirely. The voltage is simply detected on the wire, reducing latency to mere serialization overhead of a few clock cycles.
  • Historical Reference Point: This reduction mirrors Dally's work at Cray in 2006 on the Black Widow supercomputer [00:06:04], where the router latency was kept below 50 nanoseconds.
  • Target Metrics: These architectural evolutions aim to support massive LLM inferences at rates between 10,000 to 20,000 tokens per second per user [00:06:10].

AI-Driven Chip Design & Recursive Metalearning [00:23:21]

  • Natural Language Neural Architecture Search (NAS): Building on the 2017 Google Brain meta-learning and NAS work [00:07:39], Jeff Dean highlights the shift from writing code to define search spaces to simply prompting agents in natural language to discover new distillation algorithms or architectures [00:08:08].
  • Google's AlphaChip: RL approaches are actively deployed in production TPU design. The Nature-published AlphaChip framework [00:23:43] has optimized placement and routing for multiple generations of Google's Tensor Processing Units.
  • NVIDIA NVCell Automation: Transitioning a standard cell library (2,500 to 3,000 cells) to a new semiconductor process node traditionally required a team of 8 human engineers working for 10 months (80 person-months) [00:24:26]. NVIDIA's RL tool, NVCell (phonetically referred to as "NBL") [00:24:38], now completes this exact task overnight on a single GPU, outputting cell designs that match or beat human efficiency in power and space.
  • PrefixRL: Tackling the carry-lookahead adder problem—a staple of computer science since the 1950s—PrefixRL treats circuit design like an Atari game [00:25:13]. It produced highly irregular layouts that bypassed traditional human symmetry but achieved 20-30% better performance metrics [00:25:34] against size and power constraints.
  • Proprietary LLM Engineering Mentors: NVIDIA trained internal models (Chip Nemo and Bug Nemo) [00:25:50] strictly on proprietary RTL codebase, historical GPU architectures, and internal documentation. This acts as an infinitely patient mentor for junior engineers asking granular questions about legacy hardware like texture units, while automating module attribution for active bug tracking.

Inference Dominance & Workload Bifurcation [00:16:00]

  • The Power Imbalance: Bill Dally emphatically states that inference is no longer an afterthought; it constitutes 90% of the computing power utilized in modern data centers today [00:16:46].
  • Stage-Specific Provisioning: Training and inference have divergent hardware needs, but inference itself is bifurcating. The "Prefill" stage heavily resembles training [00:17:38], acting mathematically dense and parallel. In contrast, the "Decode" stage relies on highly sequential, skinny matrix-vector operations that are severely starved for memory bandwidth.
  • Future Hardware SKU Granularity: Dally predicts a market split where data centers deploy three distinct silicon profiles: one optimized for Training/Prefill, one optimized for Attention Decode, and one optimized for Feed-Forward Decode [00:18:20].
  • Algorithmic Mitigation: Techniques like speculative decoding and diffusion LLMs help alleviate sequential constraints by decoding chunks (8 to hundreds of tokens) simultaneously [00:18:42] rather than relying on strict 1-by-1 generation.

Attention Constraints & Trillion-Token Contexts [00:19:34]

  • Quadratic Scaling Limits: While standard quadratic attention yields the highest quality responses, it becomes computationally devastating at scale. Modern optimizations seek to lower the constant factor by dividing context into smaller clusters or chunks (e.g., 128 tokens) [00:21:17] and only executing full attention on the relevant localized blocks.
  • The Trillion-Token Vision: Dean argues that one million tokens is insufficient; real-world capability demands models capable of "attending" to a user's entire life of data or the entire internet—upward of 1 trillion tokens [00:22:12].
  • Hierarchical Retrieval Architectures: To simulate trillion-token windows without quadratic computation doom, systems will use staged funnels: utilizing lightweight retrieval to winnow 1 trillion tokens down to 10,000 highly relevant documents (yielding 10 to 20 million tokens) [00:22:44], before passing that distilled batch into the final deep-attention context window of around 1 million tokens [00:22:54].

The Physics of Energy, Sparsity, & Data Movement [00:32:18]

  • The 1,000x Energy Penalty: Dally breaks down the extreme energy disparity in processing. Executing a modern NVFP4 multiply-add operation costs approximately 10 femtojoules [00:33:08]. However, fetching the required 4.5 bits from external HBM4 memory costs ~3 to 4 picojoules per bit [00:33:15], equating to roughly 15 picojoules total. Therefore, reading memory burns 1,000 times more power than actually doing the math [00:33:22].
  • SRAM Alternatives: Conversely, reading from on-chip SRAM takes roughly 10 femtojoules [00:33:38], matching the compute cost and bypassing the external fetch penalty.
  • 3D Stacked DRAM (The Pachinko Machine): To fix this without the spatial constraints of SRAM, NVIDIA is aggressively pursuing vertically stacked DRAM positioned directly atop the compute die [00:34:41]. This enables processors to sense bits and pull them straight down into the GPU with an order of magnitude higher bandwidth and an order of magnitude lower energy per bit.
  • The Irregularity Tax of Sparsity: While 2:1 structured sparsity is widely deployed [00:35:34], pursuing deeper, more aggressive algorithmic sparsity destroys the physical regularity of data marching smoothly through the pipeline. Managing highly irregular data requires complex control-flow routing, often cannibalizing the efficiency gains you sought to achieve.

System-Level Evolution: Tooling Bottlenecks & Network Topologies [00:31:02]

  • Internal Foresight: Dally notes that NVIDIA creates large foundational models internally (like Megatron and GR00T) [00:09:52] specifically to test scaling limits, allowing hardware engineers to anticipate compute constraints years before customers experience them.
  • Amdahl’s Law for Agent Tooling: As agent reasoning speeds up, the tools they interact with remain built for humans. An AI acting 50 times faster than a human [00:31:12] will become bottlenecked by mundane latencies like the startup time of a C compiler or a spreadsheet macro.
  • The "CUDA Tax" Hardware Strategy: Both Google (TPU) and NVIDIA utilize a strategy of baking speculative, highly experimental hardware optimizations into silicon 2-4 years ahead of expected ML trends. If successful, this "tax" in silicon area generates 10x-20x acceleration [00:39:17]; if not, the impact is minimal.
  • Dynamic Neural Topologies: As models grow and prune organically, hardware requires a minimum operation block size—estimated around 10,000 operations [00:41:18]—to maintain processing efficiency and avoid falling back to CPU-like utilization patterns.
  • Network Architectures: Direct-connect 2D/3D Torus networks (legacy of the Cray T3D) [00:46:12] excel at highly localized ML workloads spanning adjacent chips. Conversely, Mixture of Experts (MoE) workloads, which scatter queries widely across distant expert clusters, demand fully switchable hierarchies (Fat Trees, Dragonfly networks) [00:46:35] to prevent multi-hop degradation.

The Reference Vault

4. Data & Figures

Data PointValueContextTimestamp
8th Grade Math Success Rate (Historical)40% - 50%Success rate of leading models solving logic math just 3-4 years ago.[00:01:07]
Typical LLM Layer Count50 - 200 LayersThe standard depth of modern LLMs requiring constant on-chip communication between stages.[00:04:28]
On-Chip Signal Flight Speed2 millimeters / nanosecondTarget hardware routing velocity, free of queueing and arbitration delays.[00:05:00]
Total Chip Traversal Time30 nanosecondsExpected time to move data from one corner of a GPU/TPU to another.[00:05:07]

5. Core Frameworks & Mental Models

  • Natural Language Neural Architecture Search (NL-NAS): Moving beyond programmatic metadata searches, researchers now use natural language prompts to instruct high-level models to hypothesize, test, filter, and validate novel hardware logic and distillation patterns autonomously [00:08:08].
  • The Silicon "CUDA Tax" (Hardware Hedging): Hardware architects deliberately over-provision experimental silicon pathways based on ML trend predictions 2-4 years out. While risking dead logic gates, a successful prediction nets a 10x-20x acceleration matrix for emerging algorithms before standard hardware can pivot [00:39:17].
  • The Pachinko Data Model (Stacked 3D DRAM): To solve the 1,000:1 energy penalty of lateral data movement, architects conceptualize memory stacked vertically directly atop the processor die. Data bits are sensed and dropped straight down into compute logic—like a Pachinko machine—recovering an order of magnitude in energy efficiency and bandwidth [00:34:41].
  • Amdahl's Law for Autonomous Agents: In systems engineering, overall speedup is limited by the serial part of the system. In the age of agents acting 50x faster than humans, traditional software interactions (like C-compilers or file opening times) transition from negligible background processes to the absolute ceiling on productivity loops [00:31:02].
  • Hierarchical Attention Funnels: The mental model that a system cannot apply quadratic algorithmic attention to 1 trillion tokens due to compute limits. The framework splits the pipeline: massive-scale/low-weight retrieval funnels 1 Trillion data points into 10k relevant clusters, leaving only the most vital 1 Million tokens for the computationally heavy deep attention pass [00:22:30].

6. Anecdotes

  • The "Fred Has Four Rabbits" Era: Jeff Dean reminisces on the industry just 3-4 years ago, recalling how wildly excited engineers were when models could correctly solve basic 8th-grade text problems 40-50% of the time, framing a stark contrast to today's IMO-winning mathematics capabilities [00:01:07].
  • PrefixRL Playing Atari: Bill Dally highlights PrefixRL's approach to designing carry-lookahead adders (a science largely settled since the 1950s). By treating the circuit board layout exactly like scoring points in an Atari video game, it brute-forced highly asymmetric, non-human designs that outstripped 70 years of engineering intuition by 20-30% [00:25:13].
  • The Infinitely Patient "Nemo" Mentor: NVIDIA senior engineers previously wasted countless hours explaining the intricacies of legacy Texture Units to junior hires. Now, by querying the "Chip Nemo" LLM—trained entirely on proprietary internal documents and logic trees—juniors receive infinite follow-up explanations, protecting the bandwidth of principal engineers [00:25:50].
  • The Calculator Analogy: To address fears of AI in education, Jeff Dean recalled the historical introduction of calculators in math classes. Instead of destroying students' ability to learn math, it removed calculation bottlenecks and allowed classes to move up to higher-level concepts more quickly, establishing a blueprint for how AI tutors should be integrated [00:51:26].
  • The GTC Dessert Incident (The Angel on the Shoulder): Discussing AI Health coaches, Bill Dally jokes that upon arriving at the GTC lounge, all the real food was gone but the desserts remained. He noted his "AI Angel" health coach would have successfully talked him out of his resultant dessert-only lunch, mapping micro-decisions to long-term biometric outcomes [00:53:16].
  • Scaling From a T-Mobile Store: Jeff Dean reflects on company culture, noting how Google scaled from an operation wedged above a T-Mobile store in Palo Alto to over 180,000 employees. He and Dally agree that every time a company doubles in headcount, previous communication norms fracture, requiring careful balancing of necessary bureaucracy versus startup-like community momentum [00:57:37].

7. References & Recommendations

  • Gemini: Google's multimodal LLM noted for achieving Gold Medals in the International Mathematical Olympiad (IMO) and ICPC coding competitions. [00:01:33]
  • AlphaChip: Google's Reinforcement Learning system for chip placement and routing, referenced heavily for its impact on TPU design and its foundational publication in the journal Nature. [00:23:43]
  • Megatron & GR00T: NVIDIA's foundational LLM and robotic models respectively, developed internally to anticipate future hardware requirements prior to widespread market adoption. [00:09:52]
  • NVCell, PrefixRL, Chip Nemo, Bug Nemo: NVIDIA's suite of internal AI agents responsible for automating standard cell libraries, circuit pathing, internal Q&A, and bug triaging. [00:24:38]
  • Groq: AI hardware startup (whose inference technology was licensed/acquired by NVIDIA in late 2025) referenced as an example of heavily optimized low-latency inference hardware. [00:16:20]
  • Blackwell & Rubin: NVIDIA's successive GPU architectural generations, referenced to illustrate the difficulty of using AI to port specialized hardware components across radically different architectures. [00:52:41]
  • Cray T3D & Black Widow: Historical supercomputer architectures cited by Dally as benchmarks for 3D Torus networking and ultra-low latency routing. [00:46:12]
  • AlphaGo: DeepMind's seminal RL model, referenced as a potential structural blueprint for future LLMs that learn by conversing with one another rather than static pre-training. [00:14:12]
  • ShapingAI.com: A website/paper led by Jeff Dean and co-authors analyzing the profound upcoming impacts of AI specifically across seven societal verticals: Education, Healthcare, Labor, Science, and Media generation. [00:49:10]
  • NVFP4 Format: NVIDIA's highly efficient 4-bit floating point format originally designed for inference but proving shockingly robust for training mathematics. [00:33:08]
  • Chinchilla Scaling Laws: DeepMind's seminal research on optimal compute-to-parameter ratios during pre-training, which Dean notes must be reconsidered when factoring in the lifetime cost of Inference. [00:11:44]

"Brookfield's the largest infrastructure owner in the world... We drew a pipeline and we showed all the different components of the payments ecosystem on a pipeline and said it's like a pipe that moves any commodity except what it's moving…

PHY Latency Bandwidth Reduction400 Gbps ➔ 200 GbpsIntentionally cutting bandwidth on chip pairs to bypass heavy DSP and error correction, saving critical nanoseconds.[00:05:45]
Cray Black Widow Router Latency< 50 nanosecondsA historical 2006 hardware baseline Bill Dally believes we must return to.[00:06:04]
Target Interactivity Throughput10k - 20k tokens/secThe throughput necessary per user to support fluid, high-speed agentic execution.[00:06:10]
Data Center Power Allocation~ 90%The estimated amount of modern datacenter compute cycles dedicated exclusively to model Inference rather than Training.[00:16:46]
Attention Sequence Chunking128 TokensReduced window sizes used to mitigate quadratic scaling math constraints.[00:21:17]
Hierarchical Context Retrieval1 Trillion ➔ 10k docs (10M tokens) ➔ 1M windowTheoretical funnel allowing LLMs to "attend" to massive data lakes via lightweight front-end extraction.[00:22:30]
Human Standard Cell Porting Time80 Person-MonthsLegacy time cost: A team of 8 engineers working 10 months to port 2,500-3,000 standard cells to a new process.[00:24:26]
NVCell RL Porting Time1 Night on 1 GPUThe modern time cost utilizing reinforcement learning, achieving better power metrics than humans.[00:24:38]
PrefixRL Performance Gain+ 20% to 30%Optimization margin found by RL algorithms ignoring human conventions on carry-lookahead adders.[00:25:34]
NVFP4 Multiply-Add Energy~ 10 FemtojoulesPure compute energy cost for lower-precision inference mathematics.[00:33:08]
HBM4 Read Energy (4.5 bits)~ 15 PicojoulesThe high physical energy cost (3-4 pJ per bit) of retrieving data from external memory.[00:33:15]
Energy Disparity Ratio1,000 : 1It takes one thousand times more power to fetch a number from memory than to multiply it.[00:33:22]
SRAM Read Energy~ 10 FemtojoulesOn-die localized memory access costs, matching compute operations natively.[00:33:38]
Minimum Hardware Op Block~ 10,000 OperationsThe threshold of mathematical volume required to maintain efficiency on parallel dynamic architectures.[00:41:18]
Educational Performance Lift+ 1 to 2 Std. Dev.Verified historical increase in student outcomes when provided a fully personalized tutor vs group settings.[00:50:12]