Search

Showing top 102 results for "GPU needs for LLMs"

Videos

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical Blog

…GPU fractions with bin packing for multiple small models on a GPU Many NIM workloads, like embeddings, rerankers, and small LLMs, rarely need an entire GPU. When used with GPU fractions , NVIDIA…

Feb 27, 2026 · Shwetha Krishnamurthy

Home Assistant's local LLM support outperforms Gemini for Home, and Google knows it

…You’re at the mercy of the model Google chooses and wait for an update if it falls short. Running LLMs locally frees you from those constraints — of course, you still need…

Apr 28, 2026 · Samir Makwana

I tested Nvidia's flagship GPUs for gaming, and the RTX 5090 wasn't the winner

…pay for the extra fps, but with GPU prices being what they are now, the gap is much smaller than it was at launch for the $10,000 Pro graphics card. Related…

May 8, 2026 · Joe Rice-Jones

This hidden Proxmox setting may sound cursed, but it’s really useful for coding and DIY projects

…But I occasionally need to work with AI-accelerated workloads on my dev VM. Since I’ve already enabled GPU passthrough long ago (which is a lot easier than you think), I…

May 9, 2026 · Ayush Pande

I use this local AI tool to turn boring documents into cool narrations

…Sign in to your XDA account I recently started integrating local LLMs with my arsenal of free and open-source tools, and they’ve been a game-changer for my productivity needs…

May 17, 2026 · Ayush Pande

Contemplating Meta’s Homegrown MTIA Compute Engine Roadmap

…The reason why the future MTIAs as well as the current MTIA 300, which has been deployed for R&R training workloads, need to look like GPUs and AI XPUs because they…

Apr 8, 2026 · Timothy Prickett Morgan

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

…Better GPU utilization: Separating stages lets each saturate its target resource (compute for prefill, memory bandwidth for decode) rather than alternating between both. Frameworks like NVIDIA Dynamo and llm-d , implement this…

Mar 23, 2026 · Anish Maddipoti

Discussions and forums

r/homelab · u/AntifaAustralia · 2w ago

My first 10 inch rack with local LLM! No more Spotify, Google Home, Netflix, ChatGPT...

I'm pretty new to homelabbing and this is my first mini rack! Started with the Beelink ME Mini and then just kinda grew from there (it's always the way hey haha). It idles at 70 watts (not too shabby for how much is goin…

r/LocalLLaMA · u/APFrisco · 2w ago

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and al…

r/LocalLLaMA · u/janvitos · 2w ago

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec w…

r/LocalLLaMA · u/ex-arman68 · 3w ago

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q4_0 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants,…

r/selfhosted · u/lazycodewiz · 2w ago

Followed topics

Search

Videos

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical Blog

Home Assistant's local LLM support outperforms Gemini for Home, and Google knows it

I tested Nvidia's flagship GPUs for gaming, and the RTX 5090 wasn't the winner

This hidden Proxmox setting may sound cursed, but it’s really useful for coding and DIY projects

Top stories

Trying to self-host LLMs made me realize local AI has a friction problem, not a quality problem

AMD just dropped a compact AI workstation that makes discrete GPUs look outdated for running LLMs

I added a second GPU just for local AI workloads, and it cost less than upgrading my main one

13 years later, the GTX Titan is still the most important GPU Nvidia ever made

I use this local AI tool to turn boring documents into cool narrations

Contemplating Meta’s Homegrown MTIA Compute Engine Roadmap

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

Discussions and forums

My first 10 inch rack with local LLM! No more Spotify, Google Home, Netflix, ChatGPT...

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

services with actually generous free tiers for open-source projects. my list, what would you add?

Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

ASUS Unveils Infrastructure Breakthroughs at SC24

Ollama is still the easiest way to start local LLMs, but it's the worst way to keep running them