Search

Showing top 105 results for "AI costs & tokens"

All sources blogs.nvidia.com 25 developer.nvidia.com 13 theregister.com 10 huggingface.co 10 amd.com 5 techcrunch.com 4 tomshardware.com 3 wccftech.com 3 pcworld.com 3 xda-developers.com 3 theverge.com 2 nextplatform.com 2

People also ask

What Is InferenceMAX v1 and Why Does It Matter for AI Economics?

InferenceMAX v1, a new benchmark from SemiAnalysis released Monday, is the latest to highlight Blackwell’s inference leadership. It runs popular models across leading platforms, measures performance for a wide range of use cases and publishes results anyone can verify. Why do benchmarks like this matter? Because modern AI isn’t just about raw speed — it’s about efficiency and economics at scale. As models shift from one-shot replies to multistep reasoning and tool use, they generate far more tokens per query, dramatically increasing compute demands. NVIDIA’s open-source collaborations with Ope

Telecommunications Archives

How Is AI Shifting from Pilots to AI Factories and What’s Next?

AI is moving from pilots to AI factories — infrastructure that manufactures intelligence by turning data into tokens and decisions in real time. Open, frequently updated benchmarks help teams make informed platform choices, tune for cost per token, latency service-level agreements and utilization across changing workloads. Learn more about how to calculate lowest cost per token and how the NVIDIA Think SMART framework drives cost efficient inference.

Telecommunications Archives

What Are the Factors That Lower Token Cost?

Understanding how to optimize token cost requires looking at the equation for calculating cost per million tokens. In this equation, many enterprises evaluating AI infrastructure focus on the numerator: the cost per GPU per hour. For cloud deployments, this is the hourly rate paid to a cloud provider; for on-premises deployments, it’s the effective hourly cost derived from amortizing owned infrastructure. The real key to reducing token cost, however, lies in the denominator: maximizing the delivered token output. That denominator carries two business implications. Minimize token cost: When thi

Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters

How Does Blackwell Achieve 15x Lower Cost Per Token and 10x Higher Efficiency?

Metrics like tokens per watt, cost per million tokens and TPS/user matter as much as throughput. In fact, for power-limited AI factories, Blackwell delivers 10x throughput per megawatt for mixture-of-experts models compared with the previous generation, which translates into higher token revenue. The cost per token is crucial for evaluating AI model efficiency, directly impacting operational expenses. The NVIDIA Blackwell architecture lowered cost per million tokens by 15x versus the previous generation, leading to substantial savings and fostering wider AI deployment and innovation.

Telecommunications Archives

Videos

Paper page - MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

…replaces the dense token-wise indexing in sparse attention with a routed mixture-of-experts approach that reduces computational cost while maintaining performance and handling long contexts effectively. AI-generated summary DeepSeek…

May 11, 2026

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning | NVIDIA Technical Blog

…Latent MoE that calls 4x as many expert specialists for the same inference cost, by compressing tokens before they reach the experts. Multi-token prediction (MTP) that predicts multiple future tokens in…

Mar 11, 2026 · Chris Alexiuk

Dell Technologies Introduces Dell Deskside Agentic AI Series

…Why it matters As AI workflows shift toward agentic architectures, token usage compounds at an accelerating rate, driving cloud costs that can quickly become unsustainable despite falling token prices. Cloud-only strategies…

May 18, 2026

OpenClaw creator reveals he used over $1,300,000 of OpenAI tokens in a month

…And if you're running a fleet of AI agents processing millions of requests, you'd probably expect the token bill to be pretty high. In the case of OpenClaw developer Peter…

May 18, 2026 · Andy Edser

Paper page - Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

…based key-value cache eviction improves long-context reasoning by selectively retaining useful tokens while reducing memory usage. AI-generated summary The key-value (KV) cache is a major bottleneck in long…

May 12, 2026

NVIDIA Rubin CPX GPU Is Designed For Super AI Tasks Including Million-Token Coding & GenAI, Up To 128 GB GDDR7 Memory, 30 PFLOPs of FP4

…This enables AI systems to handle million-token software coding and generative video with groundbreaking speed and efficiency. Rubin CPX works hand in hand with NVIDIA Vera CPUs and Rubin GPUs inside…

Sep 9, 2025 · Hassan Mujtaba

Claude Pro is great, but here are 3 reasons why it'll never be the only subscription you'll need

…At the time of writing, Opus 4.7 costs $5 per million input tokens and $25 per million output tokens, placing it very well above Sonnet 4.6 at $3 and $15…

May 27, 2026 · Abhinav Raj

Discussions and forums

Hacker News · u/tinyopsstudio · 4d ago

Show HN: AI agent token cost calculator for Codex and Claude Code loops

Hacker News · u/Robelkidin · 3w ago

Show HN: Token Usage Meter 12 Providers and Coding Agent

Here once again A Token Usage Meter for 12+ AI Providers Anthropic, OpenAI, Google, Alibaba qween, Moonshot Kimi, MiniMax, ElevenLabs, Deepgram, Perplexity. Qlaud.ai provides token usage meter / AI billing layer. Also Ql…

r/openai · u/VegetablePen4755 · 6d ago

DeepSeek just popped the American AI bubble.

DeepSeek just popped the American AI bubble. Not by killing AI. By killing the fantasy of unlimited AI pricing power. DeepSeek V4 Pro: Input: $0.435 per 1M tokens Output: $0.87 per 1M tokens OpenAI GPT-5.5: Input: $5.00 …

r/LocalLLaMA · u/Scared-Biscotti2287 · 2d ago

Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild

Been following the infrastructure side of AI more lately and stumbled on this from Zai. They upgraded the network architecture on a thousand-GPU cluster running GLM-5.1 coding inference from the standard ROFT setup to so…

Hacker News · u/AdarshRao23 · 2w ago

Show HN: Torrix, self hosted, LLM Observability,(no Postgres, no Redis)

I work as a SAP Integration consultant and built this as a side project. Friction point: Most self hosted LLM observability tools require Postgres, Redis and non trivial infrastructure. Teams just want to see what their …

72 4

A Modder Repurposed a Used V100 For LLM Acceleration

…The adapter cost another $100, and the fan and local taxed another $35, but for under $250, Hardware Haven made a very serviceable AI accelerator. In the Ollama LLM Benchmark, this bootstrapped…

May 11, 2026 · Jon Martindale

Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

…Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention…

May 12, 2026

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform | NVIDIA Technical Blog

…Unlocking a new category of AI experiences on the Pareto frontier A practical way to visualize this tradeoff between performance and cost is the Pareto frontier , plotting user interactivity, measured in tokens…

Mar 16, 2026 · Kyle Aubrey

Followed topics

Search

People also ask

Videos

Paper page - MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning | NVIDIA Technical Blog

Dell Technologies Introduces Dell Deskside Agentic AI Series

OpenClaw creator reveals he used over $1,300,000 of OpenAI tokens in a month

Top stories

ASUS Takes the Lead in Hybrid Agentic AI Infrastructure- Maximizing Performance While Reducing Inference Costs

AI cost crisis hits tech giants as employee 'tokenmaxxing' backfires, sparking corporate pullback at Microsoft, Meta, and Amazon — agentic AI eats up to 1000x more tokens than standard AI

Dell Launches Local ‘Deskside Agentic AI’ Workstations to Slash Cloud Token Costs

OpenClaw creator burned through $1.3 million in OpenAI API tokens in a single month — bill covered 603 billion tokens across 7.6 million requests and 100 coding agents

Paper page - Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

NVIDIA Rubin CPX GPU Is Designed For Super AI Tasks Including Million-Token Coding & GenAI, Up To 128 GB GDDR7 Memory, 30 PFLOPs of FP4

Claude Pro is great, but here are 3 reasons why it'll never be the only subscription you'll need

Discussions and forums

Show HN: AI agent token cost calculator for Codex and Claude Code loops

Show HN: Token Usage Meter 12 Providers and Coding Agent

DeepSeek just popped the American AI bubble.

Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild

Show HN: Torrix, self hosted, LLM Observability,(no Postgres, no Redis)

A Modder Repurposed a Used V100 For LLM Acceleration

Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform | NVIDIA Technical Blog