Search

Showing top 110 results for "GPU needs for LLMs"

Videos

I turned my phone into a local LLM server, and it handles vision, voice, and tool calls

…You do need to compile from master and not a tagged release, though, as support for Gemma 4 E4B's audio is pretty recent. Gemma 4 E4B itself takes a bit more…

Apr 21, 2026 · Adam Conway

How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car | NVIDIA Technical Blog

…TensorRT Edge-LLM is the NVIDIA inference framework for autoregressive models including LLMs, VLMs, and VLAs on embedded platforms. It is designed specifically for the needs of an embedded context: low latency…

May 5, 2026 · Felix Friedmann

AMD Says Agentic AI Could Put More CPUs Than GPUs in Compute Nodes

…model (LLM) to complete tasks. For example, agents can autonomously review code, implement changes, wait for compilation, and fix any new bugs on their own. There's almost no need for human…

May 6, 2026

Build a Retrieval-Augmented Generation (RAG) Agent with NVIDIA Nemotron | NVIDIA Technical Blog

…v2 LLM from the NVIDIA API Catalog . These APIs are useful for evaluating many models, quick experimentation, and getting started is free. However, for the unlimited performance and control needed in production…

Sep 23, 2025 · Edward Li

ZenDNN 5.2: Accelerating vLLM V1 Engine and Recommender Systems Inference on AMD EPYC™ CPUs

…And for good reason - GPUs aren’t going anywhere . Their massive parallel processing power remains the gold standard for heavy-lift workloads like high-throughput LLM inferencing. However, the CPU is no…

Mar 13, 2026 · Shailen Sobhee

Intel Delivers Open, Scalable AI Performance in MLPerf Inference v6.0

…such as AMX and AVX512 allow workloads like LLM inference, fine tuning, and classical machine learning to run efficiently without the need for dedicated accelerator hardware. More Context:   MLPerf Inference v6.0…

Apr 1, 2026 · Daniela Morescalchi

New in llama.cpp: Model Management

…https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh It's a win for sure. thanks for the update! does it now behave like ollama? Thank…

Dec 11, 2025

Discussions and forums

r/homelab · u/AntifaAustralia · 2w ago

My first 10 inch rack with local LLM! No more Spotify, Google Home, Netflix, ChatGPT...

I'm pretty new to homelabbing and this is my first mini rack! Started with the Beelink ME Mini and then just kinda grew from there (it's always the way hey haha). It idles at 70 watts (not too shabby for how much is goin…

r/LocalLLaMA · u/APFrisco · 2w ago

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and al…

r/LocalLLaMA · u/janvitos · 2w ago

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec w…

r/LocalLLaMA · u/ex-arman68 · 2w ago

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q4_0 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants,…

r/selfhosted · u/lazycodewiz · 1w ago

Followed topics

Search

Videos

I turned my phone into a local LLM server, and it handles vision, voice, and tool calls

How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car | NVIDIA Technical Blog

AMD Says Agentic AI Could Put More CPUs Than GPUs in Compute Nodes

Build a Retrieval-Augmented Generation (RAG) Agent with NVIDIA Nemotron | NVIDIA Technical Blog

Top stories

I built my own Googlebook with a Raspberry Pi, local LLMs, and old hardware

I added a second GPU just for local AI workloads, and it cost less than upgrading my main one

13 years later, the GTX Titan is still the most important GPU Nvidia ever made

My RTX 5090 can't keep up with Apple Silicon on the biggest local LLMs, and I hate to admit it

ZenDNN 5.2: Accelerating vLLM V1 Engine and Recommender Systems Inference on AMD EPYC™ CPUs

Intel Delivers Open, Scalable AI Performance in MLPerf Inference v6.0

New in llama.cpp: Model Management

Discussions and forums

My first 10 inch rack with local LLM! No more Spotify, Google Home, Netflix, ChatGPT...

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

services with actually generous free tiers for open-source projects. my list, what would you add?

Tiny Corp Begins Accepting Pre-Orders For Their $10M Exabox

Accelerating the Future With Ready AI Solutions

I ran Gemma 4 (26B) on a 10-year-old-GPU, and it's reliable enough to replace the cloud