Search

Showing top 116 results for "AI token costs"

All sources blogs.nvidia.com 33 developer.nvidia.com 18 theregister.com 12 huggingface.co 9 pcworld.com 4 techcrunch.com 4 amd.com 3 nextplatform.com 3 tomshardware.com 2 engadget.com 2 press.asus.com 2 newsroom.intel.com 2

People also ask

What Are the Factors That Lower Token Cost?

Understanding how to optimize token cost requires looking at the equation for calculating cost per million tokens. In this equation, many enterprises evaluating AI infrastructure focus on the numerator: the cost per GPU per hour. For cloud deployments, this is the hourly rate paid to a cloud provider; for on-premises deployments, it’s the effective hourly cost derived from amortizing owned infrastructure. The real key to reducing token cost, however, lies in the denominator: maximizing the delivered token output. That denominator carries two business implications. Minimize token cost: When thi

Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters

Why Does Cost per Token Matter Much More Than FLOPS per Dollar?

The following data for the DeepSeek-R1 AI model demonstrates the difference between theoretical and actual business outcomes. Looking at compute cost alone, the NVIDIA Blackwell platform appears to cost roughly 2x more than NVIDIA Hopper — but compute cost says nothing about the output that investment buys. An analysis of mere FLOPS per dollar suggests a 2x NVIDIA Blackwell advantage compared with the NVIDIA Hopper architecture. However, the actual outcome is orders of magnitude different: Blackwell delivers more than 50x greater token output per watt than Hopper, resulting in nearly 35x lower

Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters

What Is InferenceMAX v1 and Why Does It Matter for AI Economics?

InferenceMAX v1, a new benchmark from SemiAnalysis released Monday, is the latest to highlight Blackwell’s inference leadership. It runs popular models across leading platforms, measures performance for a wide range of use cases and publishes results anyone can verify. Why do benchmarks like this matter? Because modern AI isn’t just about raw speed — it’s about efficiency and economics at scale. As models shift from one-shot replies to multistep reasoning and tool use, they generate far more tokens per query, dramatically increasing compute demands. NVIDIA’s open-source collaborations with Ope

Telecommunications Archives

How Did NVIDIA Double Blackwell Performance Through Continuous Software Optimizations to Lower Token Cost?

NVIDIA doubled Blackwell performance through continuous software optimization, refining kernels, compiler paths, and inference runtimes so the same hardware delivers significantly more useful AI throughput over time. Initial gpt-oss-120b performance on an NVIDIA DGX Blackwell B200 system with the NVIDIA TensorRT LLM library was market-leading, but NVIDIA’s teams and the community have significantly optimized TensorRT LLM for open-source large language models. The TensorRT LLM v1.0 release is a major breakthrough in making large AI models faster and more responsive for everyone. Through advance

Telecommunications Archives

Videos

Anthropic confirms it’s been ‘adjusting’ Claude usage limits

…how the big AI providers treat subscribers on flat-rate plans. In the past, AI users on “plus,” “pro,” or “max” plans (which cost anywhere from $10-250 a month, depending on…

Mar 27, 2026 · By Ben Patterson

Gemini 3.5 Flash might be fast enough for gen AI to make sense

…Google now says that the companies using the most AI tokens could save a billion dollars per year by shifting to the more efficient Gemini 3.5 Flash. API pricing for the…

May 19, 2026 · Ryan Whitwam

My RTX 5090 can't keep up with Apple Silicon on the biggest local LLMs, and I hate to admit it

…Apple's M5 series does improve it, but the time-to-first-token on a 30,000-token prompt still feels markedly worse on the Mac even when generation speed afterwards is…

May 14, 2026 · Adam Conway

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai | NVIDIA Technical Blog

Data Center / Cloud Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai Joint benchmarking with Nebius shows that fractional GPUs significantly improve throughput and utilization for production LLM workloads Feb…

Feb 18, 2026 · Boskey Savla

GitHub Copilot is moving to usage-based billing

…What’s changing Starting June 1 , premium request units (PRUs) will be replaced by GitHub AI Credits . Credits will be consumed based on token usage, including input, output, and cached tokens, according…

Apr 27, 2026 · Mario Rodriguez

Orchestrating AI Code Review at scale

…Kimi processes the most raw input tokens (11.7B) but costs “nothing” since it runs through Workers AI. The per-agent breakdown shows where the tokens actually go: Agent Input Output Cache…

Apr 20, 2026 · Ryan Skidmore

Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo | NVIDIA Technical Blog

Agentic AI / Generative AI Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo Apr 17, 2026 By Ishan Dhanani and Matej Kosec Discuss (0) Discuss (0) L T F R E Coding…

Apr 17, 2026 · Ishan Dhanani

Discussions and forums

Hacker News · u/tinyopsstudio · 1d ago

Show HN: AI agent token cost calculator for Codex and Claude Code loops

Hacker News · u/Robelkidin · 3w ago

Show HN: Token Usage Meter 12 Providers and Coding Agent

Here once again A Token Usage Meter for 12+ AI Providers Anthropic, OpenAI, Google, Alibaba qween, Moonshot Kimi, MiniMax, ElevenLabs, Deepgram, Perplexity. Qlaud.ai provides token usage meter / AI billing layer. Also Ql…

r/openai · u/VegetablePen4755 · 4d ago

DeepSeek just popped the American AI bubble.

DeepSeek just popped the American AI bubble. Not by killing AI. By killing the fantasy of unlimited AI pricing power. DeepSeek V4 Pro: Input: $0.435 per 1M tokens Output: $0.87 per 1M tokens OpenAI GPT-5.5: Input: $5.00 …

Hacker News · u/AdarshRao23 · 2w ago

Show HN: Torrix, self hosted, LLM Observability,(no Postgres, no Redis)

I work as a SAP Integration consultant and built this as a side project. Friction point: Most self hosted LLM observability tools require Postgres, Redis and non trivial infrastructure. Teams just want to see what their …

72 4

Hacker News · u/cinooo · 4w ago

What I changed in how I use Claude Code after Anthropic's postmortem

After watching Anthropic's recent postmortem (anthropic.com/engineering/april-23-postmortem), I've been thinking about the way I approach Claude Code differently. They lowered default reasoning effort to fix latency, cal…

7 3

NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer | NVIDIA Technical Blog

…context phase of AI inference. It delivers up to 4x better training performance and up to 10x better inference performance per watt, and one-tenth the token cost relative to NVIDIA Blackwell…

Mar 16, 2026 · Rohil Bhargava

Broadcom And Google Benefit Mightily From Anthropic’s Meteoric Growth

…amortize, thereby lowering the cost of tokens. Google’s other option was to lose Anthropic as a customer and have it go off and create its own AI XPUs or do a…

Apr 7, 2026 · Timothy Prickett Morgan

Intel Arc Pro B70 Delivers A 80% Boost in MLPerf Inference v6.0, Existing Arc Pro B60 GPUs Get A 18% Boost Thanks To AI Optimizations

…running larger models. AI inference is increasingly defined not only by GPU throughput but also by CPU-accelerated system performance. The CPU, shaping overall cluster efficiency and total cost of ownership, is…

Apr 1, 2026 · Hassan Mujtaba

Followed topics