$ ls ./menu

© 2025 ESSA MAMDANI

cd ../blog
5 min read

Taalas HC1: The 17,000 Tokens/Second AI Chip That's Beating Nvidia GPUs

Audio version coming soon
Taalas HC1: The 17,000 Tokens/Second AI Chip That's Beating Nvidia GPUs
Verified by Essa Mamdani

Taalas HC1: The 17,000 Tokens/Second AI Chip That's Beating Nvidia GPUs

The era of "waiting for AI to think" is over. A Toronto startup just hard-wired an entire LLM into silicon — and the numbers are insane.


The Chip That Changes Everything

In February 2026, a 2.5-year-old Toronto startup called Taalas launched the HC1 — and the AI world can't stop talking about it.

This isn't another GPU accelerator. This isn't a TPU clone. This is something entirely different: an entire large language model, hard-wired directly into silicon.

The results? 17,000+ tokens per second. That's not a prototype. That's not a projection. That's measured, verified, and already deployed in beta.


What Makes Taalas HC1 Different?

Traditional GPUs vs. Hard-Wired AI

Here's how your Nvidia H200 works:

  1. Load the LLM weights into High Bandwidth Memory (HBM)
  2. Transfer model parameters to GPU memory for each inference
  3. Compute tokens through thousands of parallel cores
  4. Repeat for every single request

Here's how Taalas HC1 works:

  1. The entire Llama 3.1 8B model is already in the silicon — all 8 billion parameters, hard-coded at the transistor level
  2. No HBM required. No dynamic loading. No memory transfer overhead.
  3. Just compute. Pure, unadulterated inference speed.

"Unlike flexible GPUs or general-purpose ASICs, it embeds the full model, parameters, and weights into hardware, eliminating much of the overhead associated with loading and processing models dynamically." — Financial Express


The Benchmarks (These Are Real)

MetricTaalas HC1Nvidia H200/B200CerebrasWinner
Tokens/second14,357–16,960~150–200~1,400HC1
Cost per 1M tokens$0.0075$0.20–$0.49N/A100x cheaper
Power (per rack)12–15 kW120–600 kW80 kW10x less
MemoryNone (hard-wired)HBM3e requiredWafer-scaleNo HBM
CoolingAir-cooledLiquid coolingComplexSimple

The Headlines

  • 10x faster than Cerebras Wafer-Scale Engine (previously the fastest)
  • 100x faster than Nvidia's best GPUs in comparable settings
  • 10x cheaper overall inference costs
  • 0.138 seconds to generate a month-by-month WWII history response

That's not a typo. That's sub-millisecond latency for real-world queries.


Why This Matters: The End of "Thinking Time"

For years, AI users have been conditioned to wait:

  • "Thinking..."
  • "Let me work on that..."
  • Spinning wheels, loading indicators, 10-second delays for simple queries

Taalas HC1 eliminates all of that. The LLM is always ready. There's no "loading phase." There's no "initialization." The model is already in the chip, alive and waiting.

This changes the UX of AI entirely. We're moving from:

  • "Please process my request""Here's your answer now"

The Tech Specs

  • Process: TSMC 6nm
  • Size: 815 mm²
  • Transistors: 53 billion
  • Power: ~200W per card (2.5kW dual-socket server supports multiple cards)
  • Interface: PCIe 5.0
  • Cooling: Air-cooled (no exotic cooling needed)
  • Model Support: Llama 3.1 8B (current), with LoRA adapters for fine-tuning
  • Update Cycle: New model releases → hardened silicon in just 2 months (vs. 12-18 months for traditional ASICs)

The Flexibility Trade-off

Yes, there's a catch. Each HC1 chip is hard-coded for one specific model. You can't swap out Llama 3.1 8B for Mistral on the same chip.

But Taalas addresses this with:

  • LoRA adapters for fine-tuning without changing the base model
  • Multiple SKUs as models evolve
  • Metal layer changes for model updates (much faster than full re-spin)

The Company

  • Founded: ~2.3 years ago (mid-2023)
  • Headquarters: Toronto, Canada
  • Funding: $200M+ total ($169M in recent round)
  • Development cost: ~$30 million
  • Team: Just 24 people for the first product
  • CEO: Ljubisa Bajic (co-founder, former Tenstorrent executive)

That's $30 million to build something that could disrupt a $500 billion GPU market. Not bad.


What's Next?

Taalas has an aggressive roadmap:

  • Spring 2026: Mid-sized reasoning LLM on HC1 silicon
  • Winter 2026: HC2 — second-generation platform for frontier-scale models (terabyte-class models), using multi-chip designs and 4-bit floating-point

Available Now

You can actually try it right now:

  • Chat demo: chatjimmy.ai (beta)
  • API access: Available for developers
  • Hardware: HC1 chips for sale

GPU vs. Hard-Wired AI: Which Is the Future?

When to Use GPUs (Nvidia, AMD)

  • Training — GPUs are unmatched for model training due to flexibility
  • Multi-model serving — running dozens of models on one infrastructure
  • Rapid iteration — new models weekly, can't re-siliconize every time
  • Research/experimentation — need latest models, weights, architectures

When to Use Hard-Wired AI (Taalas)

  • High-volume inference — millions of requests per day
  • Single-model dominance — one model serves 90%+ of traffic
  • Cost optimization — 10x cheaper per token adds up
  • Edge deployment — no HBM, no liquid cooling, air-cooled simplicity
  • Latency-critical apps — sub-millisecond response required

The Verdict

Taalas HC1 isn't replacing Nvidia GPUs. Not yet. Maybe not ever for many use cases.

But it exposes a fundamental truth: the future of AI inference doesn't look like the past.

General-purpose GPUs were never the destination — they were a placeholder. As AI models stabilize and inference volume explodes, purpose-built silicon will eat the inference market the way GPUs ate the training market.

The question isn't whether hard-wired AI wins. The question is: who builds the best hard-wired chips?

Nvidia is already working on custom silicon. Groq has the LPU. Cerebras has the wafer-scale engine. And now Taalas has the model-hardened approach.

The AI silicon wars have officially begun.


Published: February 22, 2026 Source: Financial Express, Taalas official benchmarks