I turned my phone into a local LLM server, and it handles vision, voice, and tool calls
…You do need to compile from master and not a tagged release, though, as support for Gemma 4 E4B's audio is pretty recent. Gemma 4 E4B itself takes a bit more…
…You do need to compile from master and not a tagged release, though, as support for Gemma 4 E4B's audio is pretty recent. Gemma 4 E4B itself takes a bit more…
…TensorRT Edge-LLM is the NVIDIA inference framework for autoregressive models including LLMs, VLMs, and VLAs on embedded platforms. It is designed specifically for the needs of an embedded context: low latency…
…model (LLM) to complete tasks. For example, agents can autonomously review code, implement changes, wait for compilation, and fix any new bugs on their own. There's almost no need for human…
…v2 LLM from the NVIDIA API Catalog . These APIs are useful for evaluating many models, quick experimentation, and getting started is free. However, for the unlimited performance and control needed in production…
…And for good reason - GPUs aren’t going anywhere . Their massive parallel processing power remains the gold standard for heavy-lift workloads like high-throughput LLM inferencing. However, the CPU is no…
…such as AMX and AVX512 allow workloads like LLM inference, fine tuning, and classical machine learning to run efficiently without the need for dedicated accelerator hardware. More Context: MLPerf Inference v6.0…
…https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh It's a win for sure. thanks for the update! does it now behave like ollama? Thank…
I'm pretty new to homelabbing and this is my first mini rack! Started with the Beelink ME Mini and then just kinda grew from there (it's always the way hey haha). It idles at 70 watts (not too shabby for how much is goin…
As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and al…
Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec w…
2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q4_0 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants,…
Been in the weeds shipping an OSS side project for the past few weeks (social media publishing API). Real launch post is coming, this isn't that. Along the way I kept a list of services that actually have usable free tie…
…The Exabox is expected to come in both "red" and "green" models depending upon your preference of AMD or NVIDIA GPUs, respectively. The Exabox will reportedly ship in a ready state for…
…With generative AI and LLMs, the need for more compute performance is growing exponentially for both training and inference. GPUs are at the center of enabling generative AI, and today AMD Instinct…
…my GPU, so I started self-hosting LLMs with it I self-support my gpu now because Nvidia won't I went with the Vulkan variant of llama.cpp for this project…