Hardware Options for Running Local LLMs 2025

Explore the optimal hardware for running large language models (LLMs) locally, from entry-level edge devices to enterprise level.

List of hardware suitable for running local large language models (LLMs)

Entry-Level Edge Devices

Raspberry Pi AI Kit with Hailo-8L Accelerator
- M.2 HAT+ & Hailo-8L Module: Pre-installed 13 TOPS AI accelerator for Raspberry Pi 5
- Performance: 13 tera-operations per second (INT8)
- Price: $70
Hailo-8 M.2 AI Acceleration Kit (26 TOPS)
- 26 TOPS Performance: Double the compute power of Hailo-8L
- PCIe Gen3 x4 Interface: 4-lane connection for maximum bandwidth
- Thermal Design: Supports -40°C to 85°C operation with optional cooling fan
- Compatibility: Requires PCIe-to-M.2 adapter for Raspberry Pi 5 (inclu---ded in kit)
- Framework Support: TensorFlow, PyTorch, ONNX, Keras
- Use Cases: Multi-model inference, high-resolution video analysis
- Price: $140 (kit includes M.2 module, adapter, thermal pad)
- Key Features:
  - 2.5W typical power consumption (8.65W max)
  - CE/FCC certified for industrial deployment
  - Supports simultaneous 4K@60fps video processing
- Comparison vs. 13 TOPS Model:
  - 100% higher AI performance
  - 50% faster model inference times
  - 2× larger neural network capacity
NVIDIA Jetson Orin Nano Super

Base Model (8GB)

Price: $249 (Developer Kit)
AI Performance: 40 TOPS (67 TOPS in Super Mode)
GPU: 1024-core Ampere (1.02 GHz Max)
CPU: 6-core Arm Cortex-A78AE @ 1.7 GHz
Memory: 8GB LPDDR5 (102 GB/s bandwidth in Super Mode)
Power: 7-25W configurable

4GB Lite Version

Price: $199
AI Performance: 34 TOPS (Max)
GPU: 512-core Ampere
Memory: 4GB LPDDR5 (51 GB/s)

Key Features:

Overclocking Capability: Existing owners can unlock "Super Mode" via software update (JetPack 6.2+)
Camera Support: 8x MIPI CSI-2 lanes for 4K60 video input
Real-World Performance:
- 13B parameter LLM @ 18 tokens/sec (4-bit quantized)
- 70% faster than previous-gen Jetson Nano

Specialized & Alternative Accelerators

Google Coral Edge TPU, Intel Neural Compute Stick 2
- For tiny models and edge inference, but limited for LLMs
- Approximate price: $40–$ 100

Performance Comparison Table (Entry-Level)

Feature	Hailo-8 (26T) + RPi 5	Hailo-8L (13T) + RPi 5	NVIDIA Jetson Orin Nano Super
Price	$140	$70	$249
AI Performance (TOPS)	26	13	67
Power Consumption	2.5–8.65W	1–5W	7–25W
LLM Support	3–5B params	1–3B params	7–13B params
Camera Streams	4x 4K@30fps	2x 4K@30fps	8x 4K@30fps
Thermal Design	Active cooling option	Passive only	Active cooling required

Mid-Range Laptops & Mini-PCs

AMD Ryzen 7 8745HS with 64GB RAM
- 8 cores / 16 threads, Zen 4 (Hawk Point), 3.8–4.9 GHz
- Integrated Radeon 780M (12 CUs, up to 4 TFLOPS)
- Supports up to 256GB DDR5/LPDDR5x (dual channel, up to 120GB/s bandwidth)
- 35–54W configurable TDP
- Can run quantized LLMs up to 7B–13B parameters; performance is limited by memory bandwidth and iGPU speed, but 64GB RAM allows for larger context windows and models
- For experimentation, local chatbots, and small-scale inference
- NOTE: Currently there is missing Linux driver support
- Approximate price (mini-PC): $440–$ 549 (barebones or with 16GB RAM; expect $700–$ 900 with 64GB RAM)
AMD Ryzen 7 8745HS with 64GB RAM performance results
AMD Ryzen AI 9 HX 370 Systems
- Architecture: Zen 5 (4 cores) + Zen 5c (8 cores), 24 threads
- Clock Speeds: 2.0 GHz base, up to 5.1 GHz boost
- NPU: XDNA 2 with 50 TOPS AI performance (80 TOPS combined with CPU/GPU)
- GPU: Radeon 890M (16 RDNA 3.5 cores @ 2.9 GHz)
- Memory: LPDDR5x-8000 (up to 256GB)
- TDP: 28W configurable (15-54W range)
- Performance Highlights
  - Handles 13B-30B parameter LLMs with 4-bit quantization
  - 80 TOPS total AI processing for real-time inference
AMD Ryzen AI 9 395+ Systems
- Architecture: Zen 5 (16 cores/32 threads)
- Clock Speeds: 3.0 GHz base, up to 5.1 GHz boost
- NPU: XDNA 2 with 50 TOPS AI performance (126 TOPS combined with CPU/GPU)
- GPU: Radeon 8060S (40 RDNA 3.5 CUs @ 2.9 GHz)
- Memory: LPDDR5x-8000 (up to 128GB, 96GB allocatable as VRAM)
- TDP: 45-120W configurable
- PCIe Support: 4.0 x16 lanes
- Thermal Design: Honeywell PTM7958 phase-change material + 6 heatpipes
- Performance Highlights
  - Handles 70B+ parameter LLMs with 4-bit quantization
  - 29.7 TFLOPS GPU performance (rivals RTX 4070 Mobile)
  - 273 GB/s memory bandwidth via 256-bit bus
Apple M1/M2/M3/M4 (with 32–64GB RAM)
- Unified memory, strong neural engine
- Efficient for 7B–13B parameter models with frameworks like MLX
- Approximate price: $1,800–$ 3,500 (varies by model and RAM)

Performance Comparison Table (Mid-Range)

Feature	AMD Ryzen 7 8745HS	AMD Ryzen AI 9 HX 370	AMD Ryzen AI 9 395+	Apple M3 Systems
Price Range	$700-$ 900	$850-$ 1,500	$1,699-$ 3,699	$1,800-$ 3,500
CPU Cores	8 Zen 4	12 (4P+8E Zen 5)	16 Zen 5	8-10 Firestorm
NPU TOPS	N/A	50	50	18
GPU Performance	4 TFLOPS (780M)	12 TFLOPS (890M)	29.7 TFLOPS	20 TFLOPS (M3 GPU)
Max RAM	256GB	256GB	128GB	128GB
LLM Support	7B-13B	13B-30B	70B+	7B-13B
Power Consumption	35-54W	15-54W	45-120W	20-40W

High-End Workstations

✅ Apple Mac Studio with M3 Ultra (512GB Unified Memory)
- M3 Ultra chip: up to 32-core CPU, 80-core GPU, 32-core Neural Engine
- 512GB unified memory, 819GB/s memory bandwidth
- Can load and run extremely large models, including 4-bit quantized LLMs exceeding 600B parameters
- Real-world use: DeepSeek R1 (671B) runs locally using ~404GB storage and ~448GB RAM, at 17–18 tokens/sec, under 200W power
- Ideal for researchers, developers, and organizations needing local, private AI processing for sensitive data
- Robust I/O: Thunderbolt 5, HDMI 2.1, 10Gb Ethernet, Wi-Fi 6E, Bluetooth 5.3
- Approximate price: Starts at $3,999; fully configured (32-core CPU, 80-core GPU, 512GB RAM, 16TB SSD):$ 14,099

Consumer GPUs (Desktops)

✅ NVIDIA RTX 4060 Ti (16GB)
- Entry-level for desktop LLMs, suitable for 7B–13B models
- Approximate price: $449.99
- ❌ There is only 34 SMs, optimization mode—max-autotune GEMM is not available
NVIDIA RTX 3080 (12GB)
- Good for 13B–22B parameter models (with quantization)
- Approximate price (3080 new): $1,158; used:$ 400–$600
NVIDIA RTX 3090 (24GB)
- Good for 13B–22B parameter models (with quantization)
  - Approximate price (3090 new): $1,400–$ 1,800; ✅ used: $800–$ 1,200
NVIDIA RTX 4080 (16GB)
- Can handle 30B–40B parameter models with high performance
- Approximate price (4080): $1,199–$ 1,299
✅ NVIDIA RTX 4090 (24GB)
- Can handle 30B–40B parameter models with high performance
- Approximate price (4090): $1,599–$ 2,099 (varies by brand and availability)
✅ NVIDIA RTX 5090 (32GB)
- Best choice for local setup
- Approximate price: $2,500–$ 3,000

Popular Models You Can Run on NVIDIA RTX 5090 (32GB) Using Ollama

Model	Params*	VRAM Use	Tokens/sec	Notable Use
DeepSeek-R1:32B	32B	20GB (4-bit)	~45	Reasoning, long-form
Qwen2.5:32B	32B	20GB (4-bit)	~45	Coding, multilingual
Qwen3:30B	30B	~19GB (4-bit)	~40–45*	Agentic tasks, strong reasoning, multi-turn, multilingual
LLaMA 3:27B	27B	17GB (4-bit)	~47	Chat, coding, agent
QWQ:32B	32B	20GB (4-bit)	~57	AI agents, experimental
Gemma3:27B	27B	~18GB (4-bit)	~45	Multimodal: text, image, summarization, QA
Gemma3:12B	12B	~9GB (4-bit)	~75	Multimodal: text, image; fast responses
Gemma3n:e4b	~2B†	~7.5GB	~120+	Efficient, instruction-tuned, general text tasks
Gemma3n:e2b	~2B	~5.6GB	~180+	Fastest Gemma 3n, daily device compatibility
gpt-oss:20b	20B	14GB (4-bit)	~45	Fast chat, coding, agentic tasks

NVIDIA RTX A6000 (48GB)
- For professionals needing large context windows or fine-tuning
- Approximate price: $4,000–$ 4,500 (varies by vendor)

Enterprise & Datacenter GPUs

NVIDIA A100 (40GB/80GB HBM2e)
- Designed for 13B–70B parameter models at high speed
- Approximate price (40GB): $23,699–$ 32,449 (varies by model and vendor)
- Approximate price (80GB): $37,699
NVIDIA H100 (80GB HBM3)
- Up to 2× A100 performance, 250–300 tokens/sec on large models
- Approximate price: Starts at ~ $25,000 per GPU; multi-GPU setups can exceed$ 400,000
AMD Instinct MI250 (128GB HBM2e)
- High memory capacity, competitive for large LLMs
- Approximate price: $14,364

NVIDIA DGX Systems

NVIDIA DGX systems are purpose-built AI supercomputers designed for the most demanding machine learning and large language model workloads.

NVIDIA DGX A100
- 8× NVIDIA A100 GPUs (up to 640GB total GPU memory)
- 6× NVIDIA NVSwitches for high-speed GPU interconnect
- Dual 64-core AMD CPUs, 1TB system memory
- 15TB Gen4 NVMe SSD storage
- Approximate price: $149,000 (was$ 289,000 at launch)
NVIDIA DGX B200
- 8× Blackwell B200 GPUs (each with ~180GB HBM3e, 1.44TB total GPU memory)
- NVLink/NVSwitch for up to 1.8TB/s per GPU link
- 2× Intel 8570 CPUs (112 CPU cores), 4TB DDR5 RAM
- Approximate price: $515,410
NVIDIA DGX Station
- Desktop AI supercomputer with NVIDIA GB300 Grace Blackwell Ultra chip
- Up to 784GB unified system memory, 1× NVIDIA Blackwell Ultra GPU
- Approximate price: $50,000–$ 70,000 (estimate; varies by configuration)
✅ NVIDIA DGX Spark
- Compact AI workstation built on NVIDIA GB10 Grace Blackwell Superchip
- 20-core Arm CPU, Blackwell GPU architecture, 128GB unified system memory
- NOTE: The 273 GB/s bandwidth, may limit the effective utilization of the 1000 AI TOPS computational capability for bandwidth-bound workloads.
- Approximate price: $3,000–$ 4,000 (estimate; not widely available retail)

NVIDIA DGX Cloud
- Fully managed AI platform providing DGX infrastructure as a service
- Approximate price: Cloud subscription, typically $10–$ 30 per GPU/hour (varies by provider and GPU type)

NVIDIA DGX systems are widely used in enterprise AI, research labs, and cloud data centers to accelerate LLM development, fine-tuning, and deployment at scale.

NVIDIA DGX Spark Price

The NVIDIA DGX Spark has undergone significant price adjustments since its initial announcement"

Base Configuration

1TB Storage Model: Originally announced at $3,000 under the "Project Digits" codename, this entry-level configuration is now rarely available directly from NVIDIA but may still be offered by third-party OEMs like ASUS or Dell.
4TB Storage Model: Post-rebranding to DGX Spark, the primary retail configuration now starts at $3,999 for the 4TB NVMe SSD version. This reflects a 33% price increase compared to the original Project Digits pricing.

Market Variability

Retail Markups: Due to limited availability and high demand, reseller prices often exceed NVIDIA’s MSRP. Current market rates range from $4,200–$ 4,500 for the 4TB model.
Clustered Configurations: A pre-configured two-unit cluster with 200GbE RDMA networking and dual 4TB storage is priced at $8,500–$ 9,000, though this is not widely available to consumers.

Key Factors Influencing Cost

Technical Specifications

The DGX Spark’s pricing reflects its unique hardware:

GB10 Grace Blackwell Superchip: Combines a 20-core Arm CPU (10 Cortex-X925 + 10 Cortex-A725) and Blackwell GPU.
128GB Unified LPDDR5x Memory: Provides 273 GB/s bandwidth, critical for running models up to 200 billion parameters.
ConnectX-7 Networking: Dual-port 200GbE enables scalable clustering for distributed AI workloads.

Competitive Positioning

Compared to consumer GPUs like the RTX 5090 ($1,999 MSRP), the DGX Spark targets developers needing enterprise-grade features (e.g., NVLink-C2C, preinstalled DGX OS).
Apple’s M3 Ultra Mac Studio (512GB Unified Memory, $14,099) outperforms the DGX Spark in raw memory bandwidth (819 GB/s) but lacks clustering capabilities.

Regional Pricing Examples

United Kingdom: Scan.co.uk lists the 4TB model at £3,699.98 (~$4,700 USD including VAT).
European Union: Nextron Denmark offers configurations starting at €4,200 (~$4,550 USD).

Long-Term Value Considerations

Energy Efficiency: At 170W TDP, the DGX Spark operates at 1/3 the power consumption of comparable desktop GPUs like the RTX 4090 (450W).
Software Stack: Includes NVIDIA’s full AI toolkit (CUDA, TensorRT, Triton), valued at $10,000+ if licensed separately for enterprise use.

Detailing the 68 SM Threshold for PyTorch Inductor & GPU Compatibility

Starting in 2025, PyTorch's inductor backend introduced a key optimization mode—max-autotune GEMM—to rapidly accelerate matrix multiplications used in large language models and transformer workloads. However, this mode is only activated on NVIDIA GPUs with 68 or more Streaming Multiprocessors (SMs).

Why 68 SMs?

PyTorch’s max-autotune mode leverages enhanced parallel kernel selection, but it’s only enabled for GPUs with >= 68 SMs (a design decision made for optimal benchmarking and kernel diversity).
This threshold is hard-coded into modern PyTorch releases and cannot be overridden by user configuration.
GPUs below 68 SMs will still function, but will not take advantage of the fastest possible kernel implementations.

What is an SM?

Streaming Multiprocessors (SMs) are the core compute units of an NVIDIA GPU.
More SMs = higher parallelism = better AI/matrix performance, especially for LLMs.

GPUs with >= 68 SMs (Compatible List for Max-Autotune Mode)

GPU Model	SM Count	Typical VRAM	PyTorch Max-Autotune Mode
RTX 4090	128	24GB	✅ Enabled
RTX 4080	76	16GB	✅ Enabled
RTX 4070 Ti SUPER	66	16GB	❌ (Below threshold)
RTX 3090	82	24GB	✅ Enabled
RTX 3080	68*	10/12GB	✅ Enabled
RTX 2080 Ti	68	11GB	✅ Enabled
Tesla A100	108	40/80GB	✅ Enabled
H100	132	80GB	✅ Enabled
RTX A6000	84	48GB	✅ Enabled
DGX Spark	92	128GB	✅ Enabled
RTX 3080 10GB	68	10GB	✅ Enabled
RTX 3080 12GB	70	12GB	✅ Enabled

*Note: Some models like the RTX 3080 have different SM counts depending on VRAM configuration (68 for 10GB, 70 for 12GB).

Popular GPUs Below the 68 SM Threshold (❌ mode not enabled)

RTX 4070 (36 SMs)
RTX 3070 (46 SMs)
RTX 3060/3060 Ti (28–38 SMs)
RTX 4060 Ti (34 SMs)
Tesla T4 (40 SMs)
Any GeForce 16/20/30/40 Series below xx80, Titan RTX

Real-World Impact

On GPUs with < 68 SMs, PyTorch falls back to default ATen or Triton optimizations. All LLM workflows still function, but users will see slower matrix multiplication speeds—especially for very large models.
For maximum LLM performance with PyTorch, select a GPU with 68 or more SMs.

References

SM Counts: See NVIDIA CUDA Compute Capability documentation and each card’s spec sheet; technical PyTorch and open-source community notes.

Tip: If you have a card like RTX 4070 Ti SUPER (66 SMs) or anything below, you will not benefit from max-autotune mode. For workloads where LLM throughput or model training time is crucial, the SM count is now just as important as VRAM.

**** The CUDA SM count and PyTorch threshold are summarized from official NVIDIA and PyTorch documentation and reviews. For detailed specs, see the CUDA Wikipedia page and NVIDIA's official product pages.

You can use this content to directly supplement your "Hardware Options for Running Local LLMs 2025" guide, giving your readers actionable, up-to-date information on how the PyTorch 68 SM threshold affects their GPU choices!

Notes on AMD Ryzen 7 8745HS for LLMs

The Ryzen 7 8745HS is a modern laptop processor with a strong CPU and integrated Radeon 780M GPU.
With 64GB RAM, it can load and run quantized LLMs up to 13B parameters, though performance is limited by memory bandwidth and iGPU speed.
The architecture supports DDR5/LPDDR5x, which helps with bandwidth, but the iGPU is still much slower than dedicated GPUs for LLM inference.
For best results, use optimized frameworks (such as llama.cpp with Vulkan or ROCm backend) and focus on quantized models.
This configuration is for portable, low-power, or experimental LLM use, but not for high-throughput or large-scale deployments.

All price values are approximate and reflect typical retail or direct purchase prices in the United States as of May 2025.

Actual prices may vary by region, configuration, and market conditions.

Published on 5/26/2025