NNuggets
BookmarksCollections
  • About Us
  • Terms of use
  • Privacy policy
  • Disclaimer
  • Copyright & Takedown Policy
  • Community Guidelines
  • Cookie Policy
  • Contact

© 2026 Nuggets

NuggetsMarket PulseCollections

On this page

Speakers & Credentials

  • Speakers & Credentials
  • 1. Executive Summary
  • 2. Chronological Table of Contents
  • 3. Detailed Thematic Summary
  • The Reference Vault
  • 4. Data & Figures
  • 5. Core Frameworks & Mental Models
  • 6. Anecdotes
  • 7. References & Recommendations
  • 8. The Bottomline (by AI)

On this page

  • Speakers & Credentials
  • 1. Executive Summary
  • 2. Chronological Table of Contents
  • 3. Detailed Thematic Summary
  • The Reference Vault
  • 4. Data & Figures
  • 5. Core Frameworks & Mental Models
  • 6. Anecdotes
  • 7. References & Recommendations
  • 8. The Bottomline (by AI)
Technology/April 22, 2026/7 min read/youtu.be

Hardware-software codesign, with Clive Chan, Dylan Patel and Reiner Pope | Dylan Patel (SemiAnalysis) & MatX

Source
Source
Watch on YouTube ↗

"Co-design is when we can not only make the software better or the hardware better in isolation... we want to make both better and we might want to change the hardware in a way that makes some sacrifices today that will reap rewards in the future." - Clive Chan [00:01:08]

"It becomes co-design when you say 'actually some of these metrics are getting worse,' but that's because these are metrics run on the current generation Nvidia GPU." - Reiner Pope [00:02:12]

References

  1. Original source (youtu.be)

Disclaimer: Orignal content owned by or sourced from third parties. It does not represent the views of 'Nuggets' platform or it's team. AI is used extensively across this platform including for summaries. Accuracy is not guaranteed, there can be mistakes. Any info or content on this platform is not a financial, legal, or investment advice. Do your own research. Refer for complete disclosures:- Terms of Use · Full Disclaimer

Related nuggets

Jun 2, 2026

AI Is Escaping the Screen | 01 Jun 2026 | Coatue

Coatue : AI is entering a new phase: moving beyond digital tools and into fully autonomous systems operating in the physical world. From advanced manufacturing and surgical robotics to robots in the home, the next wave of innovation will b…

Jun 2, 2026

Kalshi Monthly Volume - Politics ($M) | Chart of the Day | Coatue

Coatue: Kalshi's political volume has scaled dramatically, and the American Power Index KPOW is what that scale enables: a single number gauge of the current balance of political power and where markets expect it to move, which Kalshi bill…

Jun 2, 2026

The BlackBerry Problem |18 May 2026 | The Mistakes Series | Malcolm Gladwell's Revisionist History

"My mistake and naivity was to think that people are were with me so you're flying around the world you're trying to get people on side and you think they're on side but they're not mhm mhm and you get blindsight" Jim Balsillie 00:01:34 ht…

Jun 2, 2026

Partnership Perspectives: Network International | 2 Jun 2026 | Brookfield Perspectives

Actions

Reading

Published
April 22, 2026
Read time
7 min read
Progress0%

"I kind of think you want to make a chip that people do hate to some extent... People don't hate a chip if it's got a very wide basin of operating points where it can work well, and I think that actually means you've left some resources on the table." - Reiner Pope [00:08:49]

"The fact that pre-filled tokens are five times cheaper than decode tokens says something about what utilization or efficiency is in pre-fill versus decode." - Reiner Pope [00:24:42]

"The way I like to frame it is perplexity per picojoule—if you have the same perplexity, how many picojoules, how much energy did you take to generate this token?" - Clive Chan [00:30:22]


Speakers & Credentials

  • Dylan Patel (Host): Chief Analyst at SemiAnalysis; leading expert in semiconductor supply chains and AI hardware economics.
  • Reiner Pope (Guest): CEO at MatX. Previously a senior engineer at Google where he led inference stack optimization and co-authored the foundational PaLM (Pathways Language Model) paper [00:00:17].
  • Clive Chan (Guest): Co-founder at MatX. Specialist in hardware-software co-design, focusing on the physical and energetic constraints of AI silicon [00:00:29].

1. Executive Summary

  • True Co-design: The core thesis is that hardware-software co-design requires intentional "sacrifices" in current software metrics to enable massive efficiency gains on future architectures [00:01:23].
  • The GPU Trap: Standard ML research is often biased toward what runs fast on current Nvidia GPUs, ignoring more fundamental metrics like gate count and energy-per-op [00:02:18].
  • Utilization Gaps: Current market pricing reveals a massive inefficiency: pre-fill tokens are ~5x cheaper than decode tokens, implying that 80% of flops sit idle during inference sampling [00:24:42].
  • Sampling-Heavy Era: Future hardware must optimize for the low arithmetic intensity of the reasoning/sampling era (predicted by Ilya Sutskever in 2022), which favors SRAM and memory bandwidth over raw TFLOPS [00:21:13].
  • New Efficiency Paradigm: The industry should transition from measuring pure perplexity to Perplexity per Picojoule, establishing a Pareto front of intelligence vs. energy cost [00:30:35].

2. Chronological Table of Contents

  • [00:00:00] - Introduction: PaLM Paper and Google's Infra Research.
  • [00:01:08] - Defining the "Sacrifice" in Co-design.
  • [00:02:48] - Activation Function Trade-offs (Swish vs. ReLU).
  • [00:04:30] - DeepSeek V3 and Alibaba Architecture Hacks.
  • [00:08:05] - The "Hated" Chip: TPU v6e (Trillium) Case Study.
  • [00:10:00] - Physicality: 2D Silicon Constraints and Dojo.
  • [00:13:20] - Determining the Correct "Grain Size" for Hardware.
  • [00:16:40] - Programming Pain: Palace, JAX, and XLA on TPUs.
  • [00:20:15] - The Ilya Sutskever Effect: Test-Time Scaling.
  • [00:23:00] - Low-Latency Regimes: SRAM vs. HBM.
  • [00:25:23] - Disaggregation: Decoupling Attention from MLP.
  • [00:28:59] - The Quantization Trap and "EXFLOP" Metrics.

3. Detailed Thematic Summary

The Co-design Philosophy & Metric Shift [00:01:08]

  • Sacrificing Local Maxima: Co-design is distinct from ML research; it involves intentionally making a model look "worse" on current hardware to unlock architectural efficiencies on future silicon [00:01:23].
  • Activation Efficiency: Swish activations require complex exponential functions and lookup tables, consuming significant energy [00:03:04]. ReLU is hardware-efficient but harder for model quality [00:03:16].
  • Tensor Core Width: Hardware designers often widen tensor cores for more arithmetic throughput. Co-designers must tell researchers to make models wider specifically to saturate that silicon, even if it feels "inefficient" for pure parameter count [00:03:55].

The Politics of Specialization: Why Researchers "Hate" Specialized Chips [00:08:05]

  • TPU v6e (Trillium/Ghostlight): This chip is viewed as "weird" because it pairs 1 Petaflop of compute with only 32GB of HBM [00:08:16]. This creates a narrow operating point that forces researchers into rigid model shapes [00:08:28].
  • The "Hate" Indicator: If researchers don't "hate" a chip, it likely has too much "slack" (unused resources), meaning it isn't optimally efficient. High utilization requires narrow operating basins [00:08:49].
  • 2D Silicon Reality: Chips are fundamentally two-dimensional. Systolic arrays thrive in 2D, while global L1/L2 cache hierarchies that attempt universal connectivity face massive physical and energy overheads [00:10:41].

The Sampling Era & Test-Time Scaling [00:20:15]

  • The Ilya Sutskever Prediction: As early as 2022, Sutskever identified that intelligence gains would shift from pre-training to test-time scaling (sampling/reasoning) [00:20:40].
  • Arithmetic Intensity Crisis: Pre-training is compute-bound, but reasoning/sampling is memory-bound (low arithmetic intensity). Hardware designed for 2022 pre-training is inefficient for the 2026 reasoning regime [00:21:13].
  • Solving the Idle Flops: Since 80% of flops sit idle during decode, a co-design solution is to crank up MLP size by 5x during the decode phase to utilize spare flops at zero extra time cost [00:25:12].
  • Disaggregation: Decoupling the MLP (put weights in SRAM/3D RAM) from the Attention mechanism (HBM) allows for heterogeneous hardware that matches the specific needs of each layer type [00:26:04].

Quantization & Intelligence Metrics [00:28:59]

  • The Quantization Trap: Claims of 4-bit models being "97% as accurate" are often misleading; they typically result in a 70B model performing like an 8B model [00:29:05].
  • The 40% Overhead: To maintain full quality under quantization, models generally require a 40% increase in parameter count [00:30:08].
  • Perplexity per Picojoule: The ultimate efficiency metric. Clive's license plate "EXFLOP" and the technical joke of "pJ per Op" emphasize the shift from TFLOPS to energy-based intelligence metrics [00:30:22].

The Reference Vault

4. Data & Figures

Data PointValueContextTimestamp
TPU v6e Compute1 PetaflopRaw compute capability of Google's 'Trillium' chip.[00:08:16]
TPU v6e HBM32 GBHigh Bandwidth Memory capacity; 1/4th of Nvidia Hopper.[00:08:16]
Pre-fill vs Decode5:1Cost ratio at model labs; decode tokens are 5x more expensive.[00:24:42]
Quantization Tax40%Required increase in model size to offset quantization quality loss.[00:30:08]

5. Core Frameworks & Mental Models

  • Intelligence per Picojoule: Evaluating AI systems by the Joules required to reach a specific perplexity target [00:30:22].
  • Grain Size Selection: The engineering decision of how "big" a hardware operation should be. FPGAs are too small; burning-in full Transformers is too rigid [00:13:23].
  • The "Hated Chip" Theory: A specialized chip must be rigid to be efficient. Lack of flexibility is a feature of high-utilization hardware [00:08:49].
  • Test-Time Scaling: The paradigm shift where compute is spent at the end (sampling/thinking) rather than just the beginning (training) [00:20:40].

6. Anecdotes

  • The PaLM Paper Effect: Reiner notes that the PaLM paper was the last "valuable" infrastructure paper Google published before closing up its internal research [00:00:38].
  • Trevor Dye’s Nvidia Wishlist: An example of direct co-design where Microsoft/OpenAI provided a specific architectural wishlist that Nvidia implemented in the Hopper series [00:06:06].
  • DeepSeek V3’s Hopper-Matching: DeepSeek designed their attention mechanism specifically so the arithmetic intensity matches Nvidia Hopper’s limits [00:04:30].
  • The EXFLOP License Plate: Clive Chan’s personal license plate serves as a reminder of the goal of hardware-software co-design: reaching the next order of magnitude in efficient compute [00:30:42].

7. References & Recommendations

  • Papers: PaLM (Pathways Language Model) [00:00:38], DeepSeek V3 [00:04:30].
  • Hardware: TPU v6e (Ghostlight/Trillium), Nvidia Hopper/Blackwell, Tesla Dojo, Grock, Cerebras [00:08:05].
  • Tools/Software: Palace, JAX, XLA (Google programming stacks) [00:17:00].
  • Entities: Google DeepMind, OpenAI (Strawberry/Reasoning projects), Alibaba Labs [00:19:11, 00:04:54].

8. The Bottomline (by AI)

The era of generalized AI hardware is ending as the industry pivots from raw pre-training to test-time reasoning, which exposes a massive 80% compute utilization gap during sampling. To survive this transition, developers must abandon "Nvidia-first" optimization and embrace "Perplexity per Picojoule" as the only metric that matters for scaling intelligence economically. Watch for the emergence of disaggregated silicon that decouples MLP and Attention layers, effectively "burning in" the transformer architecture to reclaim the massive energy losses inherent in today’s GPU-dominated inference stacks.

"Brookfield's the largest infrastructure owner in the world... We drew a pipeline and we showed all the different components of the payments ecosystem on a pipeline and said it's like a pipe that moves any commodity except what it's moving…

Idle Flops80%Percentage of compute sitting idle during sampling/decode.[00:25:12]
Layer Optimization100 to 70Alibaba lab's change to layer count to achieve 20% faster inference.[00:04:54]