Model Management
Local Models
Local Models
Run AI models locally on your own hardware using HIVE Protocol. Local models provide privacy, cost savings, and offline capability. This guide covers setup with Ollama, hardware requirements, and optimization tips.
Why Use Local Models?
┌─────────────────────────────────────────────────────────────────┐
│ Local vs Cloud Models │
├─────────────────────────────────────────────────────────────────┤
│ │
│ LOCAL MODELS CLOUD MODELS │
│ ──────────── ──────────── │
│ + Complete privacy + No hardware needed │
│ + No per-token costs + Always latest models │
│ + Works offline + Highest capability │
│ + Data never leaves device + Managed infrastructure │
│ - Requires hardware - Per-token costs │
│ - Limited model selection - Requires internet │
│ - Self-managed - Data sent to provider │
│ │
└─────────────────────────────────────────────────────────────────┘When to Use Local Models
| Scenario | Local | Cloud |
|---|---|---|
| Sensitive data processing | Yes | Evaluate privacy policies |
| Offline operation required | Yes | No |
| High volume, cost-sensitive | Yes | Consider budget |
| Highest quality outputs needed | Usually no | Yes |
| No GPU available | Possible but slow | Yes |
| Development and testing | Yes | Yes |
Ollama Setup
Ollama is the recommended way to run local models with HIVE Protocol. It provides a simple interface and supports many popular open-source models.
Installation
macOS
# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Or use Homebrew
brew install ollamaLinux
# Download and install
curl -fsSL https://ollama.com/install.sh | shWindows
# Download from ollama.com/download
# Run the installer
# Ollama will start automaticallyStarting Ollama
# Start the Ollama service
ollama serve
# The service runs at http://localhost:11434Pulling Models
Download models before using them:
# Pull Llama 3.1 (recommended)
ollama pull llama3.1
# Pull specific size
ollama pull llama3.1:8b
ollama pull llama3.1:70b
# Pull other popular models
ollama pull mistral
ollama pull codellama
ollama pull qwen2.5Testing Ollama
# Quick test
ollama run llama3.1 "Hello, how are you?"
# Interactive mode
ollama run llama3.1
# List installed models
ollama listConnecting to HIVE Protocol
Configuration
- Go to Settings > Integrations > Local Models
- Enter Ollama endpoint:
http://localhost:11434 - Click Test Connection
- Select available models
┌─────────────────────────────────────────────────────────────────┐
│ Settings > Local Models │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Ollama Configuration │
│ │
│ Endpoint: http://localhost:11434 │
│ │
│ Status: [*] Connected │
│ │
│ Available Models: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ [x] llama3.1:8b (4.7 GB) │ │
│ │ [x] mistral:7b (4.1 GB) │ │
│ │ [x] codellama:13b (7.4 GB) │ │
│ │ [ ] qwen2.5:72b (41 GB) - Insufficient RAM │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ [Test Connection] [Save] │
└─────────────────────────────────────────────────────────────────┘Network Configuration
For remote Ollama servers:
# On the Ollama server, allow external connections
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# In HIVE, configure the remote endpoint
http://your-server-ip:11434Supported Local Models
Llama 3.1 (Recommended)
Meta's latest open-source model, excellent for general tasks.
| Variant | Size | RAM Required | Best For |
|---|---|---|---|
| llama3.1:8b | 4.7 GB | 8 GB | Fast responses, general use |
| llama3.1:70b | 40 GB | 48 GB | Higher quality, complex tasks |
ollama pull llama3.1:8bStrengths:
- Strong reasoning capabilities
- Good instruction following
- Multilingual support
- Excellent for code
Mistral
Fast and capable model from Mistral AI.
| Variant | Size | RAM Required | Best For |
|---|---|---|---|
| mistral:7b | 4.1 GB | 8 GB | Quick tasks, chat |
| mixtral:8x7b | 26 GB | 32 GB | Complex reasoning |
ollama pull mistralStrengths:
- Very fast inference
- Good at following instructions
- Strong for European languages
Code Llama
Specialized for coding tasks.
| Variant | Size | RAM Required | Best For |
|---|---|---|---|
| codellama:7b | 3.8 GB | 8 GB | Code completion |
| codellama:13b | 7.4 GB | 16 GB | Code generation |
| codellama:34b | 19 GB | 24 GB | Complex code tasks |
ollama pull codellama:13bStrengths:
- Excellent code completion
- Multiple language support
- Code explanation and review
Qwen 2.5
Alibaba's powerful open model.
| Variant | Size | RAM Required | Best For |
|---|---|---|---|
| qwen2.5:7b | 4.4 GB | 8 GB | General tasks |
| qwen2.5:14b | 8.9 GB | 16 GB | Higher quality |
| qwen2.5:72b | 41 GB | 48 GB | Best quality |
ollama pull qwen2.5:14bStrengths:
- Strong multilingual (especially Asian languages)
- Good at math and reasoning
- Large context window support
Phi-3
Microsoft's compact but capable model.
| Variant | Size | RAM Required | Best For |
|---|---|---|---|
| phi3:mini | 2.2 GB | 4 GB | Limited hardware |
| phi3:medium | 7.9 GB | 12 GB | Better quality |
ollama pull phi3:miniStrengths:
- Runs on minimal hardware
- Surprisingly capable for size
- Fast responses
Hardware Requirements
Minimum Requirements
CPU-Only Operation:
- Modern multi-core CPU (8+ cores recommended)
- 16 GB RAM minimum
- SSD storage
- Expect slow responses (10-60 seconds)
For 7B Parameter Models:
- 8 GB RAM
- 5 GB disk space
- CPU or entry-level GPURecommended for Performance
GPU-Accelerated Operation:
- NVIDIA GPU with 8+ GB VRAM
- 32 GB system RAM
- NVMe SSD
- CUDA 11.8+ drivers
For 13B-34B Parameter Models:
- 16-24 GB VRAM
- 32 GB system RAM
- 20-50 GB disk spaceGPU Comparison
| GPU | VRAM | Max Model Size | Performance |
|---|---|---|---|
| RTX 3060 | 12 GB | 7B comfortably | ~20 tokens/sec |
| RTX 3080 | 10 GB | 7B models | ~30 tokens/sec |
| RTX 4080 | 16 GB | 13B models | ~40 tokens/sec |
| RTX 4090 | 24 GB | 34B models | ~50 tokens/sec |
| A100 | 40-80 GB | 70B models | ~100 tokens/sec |
Model Size vs Performance
┌─────────────────────────────────────────────────────────────────┐
│ Model Size vs Quality/Speed Tradeoff │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Quality │
│ ^ │
│ │ 70B │
│ │ * * * │
│ │ 34B │
│ │ * * * │
│ │ 13B │
│ │ * * * │
│ │ 7B │
│ │ * * * │
│ │ 3B │
│ │ * * │
│ └─────────────────────────────────────────────────────> Speed │
│ │
│ Smaller models = Faster but lower quality │
│ Larger models = Slower but higher quality │
│ │
└─────────────────────────────────────────────────────────────────┘Performance Optimization
GPU Configuration
Enable GPU acceleration:
# Check GPU detection
ollama ps
# Force GPU usage
export OLLAMA_NUM_GPU=999
# Limit GPU memory (useful for sharing)
export OLLAMA_GPU_MEMORY=6gMemory Optimization
# Reduce memory footprint
export OLLAMA_NUM_PARALLEL=1
# Use quantized models (smaller, faster)
ollama pull llama3.1:8b-q4_0
# Keep models loaded for faster response
export OLLAMA_KEEP_ALIVE=60mQuantization Options
| Quantization | Size Reduction | Quality Impact |
|---|---|---|
| Q8_0 | ~50% | Minimal |
| Q6_K | ~60% | Very slight |
| Q5_K_M | ~65% | Slight |
| Q4_K_M | ~70% | Noticeable |
| Q4_0 | ~75% | More noticeable |
# Pull quantized version
ollama pull llama3.1:8b-q4_K_MUsing Local Models in HIVE
Agent Configuration
Configure agents to use local models:
Agent: Local Research Assistant
Framework: ollama
Model: llama3.1:8b
Settings:
temperature: 0.7
context_length: 4096
num_predict: 2048Mixed Model Setup
Use local models for some agents, cloud for others:
Swarm: Hybrid Pipeline
Agents:
- name: Data Processor
framework: ollama
model: llama3.1:8b
# Local for privacy and cost
- name: Quality Reviewer
framework: anthropic
model: claude-sonnet-4-20250514
# Cloud for highest quality
- name: Code Assistant
framework: ollama
model: codellama:13b
# Local for code tasksOther Local Solutions
LM Studio
GUI-based local model runner.
1. Download from lmstudio.ai
2. Install and launch
3. Download models from built-in browser
4. Start local server
5. Configure HIVE with endpoint: http://localhost:1234/v1Text Generation WebUI
Advanced interface with many features.
# Clone repository
git clone https://github.com/oobabooga/text-generation-webui
# Run with API
python server.py --api
# Connect HIVE to: http://localhost:5000llama.cpp
Direct C++ implementation for maximum performance.
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Run server
./server -m model.gguf --port 8080Troubleshooting
Ollama Not Starting
Symptom: "Error: could not connect to ollama app"
Solutions:
1. Check if Ollama is running:
ps aux | grep ollama
2. Start Ollama service:
ollama serve
3. Check logs:
journalctl -u ollama (Linux)
cat ~/.ollama/logs/server.logModel Download Fails
Symptom: "Error pulling model"
Solutions:
1. Check disk space:
df -h
2. Check network connection:
curl -I https://ollama.com
3. Try alternative model source:
ollama pull llama3.1:8b --insecureSlow Performance
Symptom: Very slow responses
Solutions:
1. Verify GPU is being used:
ollama ps
nvidia-smi
2. Use smaller or quantized model:
ollama pull llama3.1:8b-q4_0
3. Reduce context length in agent settings
4. Check for other GPU processes:
nvidia-smiOut of Memory
Symptom: "CUDA out of memory" or system freeze
Solutions:
1. Use smaller model variant
2. Use quantized version
3. Reduce context length
4. Close other applications
5. Enable GPU memory limit:
export OLLAMA_GPU_MEMORY=6gCost Comparison
Local vs Cloud Costs
Assuming 10 million tokens per month:
| Solution | Initial Cost | Monthly Cost |
|---|---|---|
| Gemini Flash (Cloud) | $0 | $19 |
| GPT-4o Mini (Cloud) | $0 | $38 |
| Claude Haiku (Cloud) | $0 | $75 |
| RTX 4080 + Llama (Local) | $1,200 | $10 (electricity) |
| GPT-4o (Cloud) | $0 | $625 |
| Cloud GPU Rental | $0 | $200-500 |
Break-Even Analysis
For GPT-4o Mini equivalent quality:
Cloud cost: $38/month
Local GPU: $1,200 upfront + $10/month electricity
Break-even: ~40 months
For higher volume (100M tokens/month):
Cloud cost: $380/month
Break-even: ~3 monthsBest Practices
1. Start Small
Begin with:
1. Small 7B parameter model
2. Test on your hardware
3. Gradually increase size
4. Find your sweet spot2. Use Appropriate Models
Match model to task:
- Simple tasks: phi3:mini (fast, small)
- General tasks: llama3.1:8b (balanced)
- Code tasks: codellama:13b (specialized)
- Complex tasks: llama3.1:70b (quality)3. Keep Models Updated
# Update installed models
ollama pull llama3.1:8b
# Check for updates regularly
ollama list4. Monitor Resources
# GPU monitoring
watch -n1 nvidia-smi
# System resources
htopRelated Documentation
- [Supported Models](/docs/models/supported-models): Compare with cloud models
- [API Keys](/docs/models/api-keys): Configure cloud providers
- [Model Configuration](/docs/models/model-configuration): Tune model parameters