Model Management

Local Models

Run AI models locally on your own hardware using HIVE Protocol. Local models provide privacy, cost savings, and offline capability. This guide covers setup with Ollama, hardware requirements, and optimization tips.

Why Use Local Models?

┌─────────────────────────────────────────────────────────────────┐
│                    Local vs Cloud Models                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LOCAL MODELS                    CLOUD MODELS                   │
│  ────────────                    ────────────                   │
│  + Complete privacy              + No hardware needed           │
│  + No per-token costs            + Always latest models        │
│  + Works offline                 + Highest capability          │
│  + Data never leaves device      + Managed infrastructure      │
│  - Requires hardware             - Per-token costs             │
│  - Limited model selection       - Requires internet           │
│  - Self-managed                  - Data sent to provider       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

When to Use Local Models

Scenario	Local	Cloud
Sensitive data processing	Yes	Evaluate privacy policies
Offline operation required	Yes	No
High volume, cost-sensitive	Yes	Consider budget
Highest quality outputs needed	Usually no	Yes
No GPU available	Possible but slow	Yes
Development and testing	Yes	Yes

Ollama Setup

Ollama is the recommended way to run local models with HIVE Protocol. It provides a simple interface and supports many popular open-source models.

Installation

macOS

# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Or use Homebrew
brew install ollama

Linux

# Download and install
curl -fsSL https://ollama.com/install.sh | sh

Windows

# Download from ollama.com/download
# Run the installer
# Ollama will start automatically

Starting Ollama

# Start the Ollama service
ollama serve

# The service runs at http://localhost:11434

Pulling Models

Download models before using them:

# Pull Llama 3.1 (recommended)
ollama pull llama3.1

# Pull specific size
ollama pull llama3.1:8b
ollama pull llama3.1:70b

# Pull other popular models
ollama pull mistral
ollama pull codellama
ollama pull qwen2.5

Testing Ollama

# Quick test
ollama run llama3.1 "Hello, how are you?"

# Interactive mode
ollama run llama3.1

# List installed models
ollama list

Connecting to HIVE Protocol

Configuration

Go to Settings > Integrations > Local Models
Enter Ollama endpoint: http://localhost:11434
Click Test Connection
Select available models

┌─────────────────────────────────────────────────────────────────┐
│  Settings > Local Models                                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Ollama Configuration                                            │
│                                                                  │
│  Endpoint: http://localhost:11434                                │
│                                                                  │
│  Status: [*] Connected                                          │
│                                                                  │
│  Available Models:                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ [x] llama3.1:8b          (4.7 GB)                       │    │
│  │ [x] mistral:7b           (4.1 GB)                       │    │
│  │ [x] codellama:13b        (7.4 GB)                       │    │
│  │ [ ] qwen2.5:72b          (41 GB) - Insufficient RAM     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│                                      [Test Connection] [Save]   │
└─────────────────────────────────────────────────────────────────┘

Network Configuration

For remote Ollama servers:

# On the Ollama server, allow external connections
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# In HIVE, configure the remote endpoint
http://your-server-ip:11434

Supported Local Models

Llama 3.1 (Recommended)

Meta's latest open-source model, excellent for general tasks.

Variant	Size	RAM Required	Best For
llama3.1:8b	4.7 GB	8 GB	Fast responses, general use
llama3.1:70b	40 GB	48 GB	Higher quality, complex tasks

ollama pull llama3.1:8b

Strengths:

Strong reasoning capabilities
Good instruction following
Multilingual support
Excellent for code

Mistral

Fast and capable model from Mistral AI.

Variant	Size	RAM Required	Best For
mistral:7b	4.1 GB	8 GB	Quick tasks, chat
mixtral:8x7b	26 GB	32 GB	Complex reasoning

ollama pull mistral

Strengths:

Very fast inference
Good at following instructions
Strong for European languages

Code Llama

Specialized for coding tasks.

Variant	Size	RAM Required	Best For
codellama:7b	3.8 GB	8 GB	Code completion
codellama:13b	7.4 GB	16 GB	Code generation
codellama:34b	19 GB	24 GB	Complex code tasks

ollama pull codellama:13b

Strengths:

Excellent code completion
Multiple language support
Code explanation and review

Qwen 2.5

Alibaba's powerful open model.

Variant	Size	RAM Required	Best For
qwen2.5:7b	4.4 GB	8 GB	General tasks
qwen2.5:14b	8.9 GB	16 GB	Higher quality
qwen2.5:72b	41 GB	48 GB	Best quality

ollama pull qwen2.5:14b

Strengths:

Strong multilingual (especially Asian languages)
Good at math and reasoning
Large context window support

Phi-3

Microsoft's compact but capable model.

Variant	Size	RAM Required	Best For
phi3:mini	2.2 GB	4 GB	Limited hardware
phi3:medium	7.9 GB	12 GB	Better quality

ollama pull phi3:mini

Strengths:

Runs on minimal hardware
Surprisingly capable for size
Fast responses

Hardware Requirements

Minimum Requirements

CPU-Only Operation:
  - Modern multi-core CPU (8+ cores recommended)
  - 16 GB RAM minimum
  - SSD storage
  - Expect slow responses (10-60 seconds)

For 7B Parameter Models:
  - 8 GB RAM
  - 5 GB disk space
  - CPU or entry-level GPU

Recommended for Performance

GPU-Accelerated Operation:
  - NVIDIA GPU with 8+ GB VRAM
  - 32 GB system RAM
  - NVMe SSD
  - CUDA 11.8+ drivers

For 13B-34B Parameter Models:
  - 16-24 GB VRAM
  - 32 GB system RAM
  - 20-50 GB disk space

GPU Comparison

GPU	VRAM	Max Model Size	Performance
RTX 3060	12 GB	7B comfortably	~20 tokens/sec
RTX 3080	10 GB	7B models	~30 tokens/sec
RTX 4080	16 GB	13B models	~40 tokens/sec
RTX 4090	24 GB	34B models	~50 tokens/sec
A100	40-80 GB	70B models	~100 tokens/sec

Model Size vs Performance

┌─────────────────────────────────────────────────────────────────┐
│  Model Size vs Quality/Speed Tradeoff                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Quality                                                         │
│    ^                                                             │
│    │                                          70B               │
│    │                                       * * *                │
│    │                                   34B                       │
│    │                              * * *                         │
│    │                         13B                                 │
│    │                     * * *                                  │
│    │                7B                                          │
│    │           * * *                                            │
│    │      3B                                                    │
│    │  * *                                                       │
│    └─────────────────────────────────────────────────────> Speed │
│                                                                  │
│  Smaller models = Faster but lower quality                      │
│  Larger models = Slower but higher quality                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Performance Optimization

GPU Configuration

Enable GPU acceleration:

# Check GPU detection
ollama ps

# Force GPU usage
export OLLAMA_NUM_GPU=999

# Limit GPU memory (useful for sharing)
export OLLAMA_GPU_MEMORY=6g

Memory Optimization

# Reduce memory footprint
export OLLAMA_NUM_PARALLEL=1

# Use quantized models (smaller, faster)
ollama pull llama3.1:8b-q4_0

# Keep models loaded for faster response
export OLLAMA_KEEP_ALIVE=60m

Quantization Options

Quantization	Size Reduction	Quality Impact
Q8_0	~50%	Minimal
Q6_K	~60%	Very slight
Q5_K_M	~65%	Slight
Q4_K_M	~70%	Noticeable
Q4_0	~75%	More noticeable

# Pull quantized version
ollama pull llama3.1:8b-q4_K_M

Using Local Models in HIVE

Agent Configuration

Configure agents to use local models:

Agent: Local Research Assistant
Framework: ollama
Model: llama3.1:8b
Settings:
  temperature: 0.7
  context_length: 4096
  num_predict: 2048

Mixed Model Setup

Use local models for some agents, cloud for others:

Swarm: Hybrid Pipeline
Agents:
  - name: Data Processor
    framework: ollama
    model: llama3.1:8b
    # Local for privacy and cost

  - name: Quality Reviewer
    framework: anthropic
    model: claude-sonnet-4-20250514
    # Cloud for highest quality

  - name: Code Assistant
    framework: ollama
    model: codellama:13b
    # Local for code tasks

Troubleshooting

Ollama Not Starting

Symptom: "Error: could not connect to ollama app"

Solutions:
1. Check if Ollama is running:
   ps aux | grep ollama

2. Start Ollama service:
   ollama serve

3. Check logs:
   journalctl -u ollama (Linux)
   cat ~/.ollama/logs/server.log

Model Download Fails

Symptom: "Error pulling model"

Solutions:
1. Check disk space:
   df -h

2. Check network connection:
   curl -I https://ollama.com

3. Try alternative model source:
   ollama pull llama3.1:8b --insecure

Slow Performance

Symptom: Very slow responses

Solutions:
1. Verify GPU is being used:
   ollama ps
   nvidia-smi

2. Use smaller or quantized model:
   ollama pull llama3.1:8b-q4_0

3. Reduce context length in agent settings

4. Check for other GPU processes:
   nvidia-smi

Out of Memory

Symptom: "CUDA out of memory" or system freeze

Solutions:
1. Use smaller model variant
2. Use quantized version
3. Reduce context length
4. Close other applications
5. Enable GPU memory limit:
   export OLLAMA_GPU_MEMORY=6g

Cost Comparison

Local vs Cloud Costs

Assuming 10 million tokens per month:

Solution	Initial Cost	Monthly Cost
Gemini Flash (Cloud)	$0	$19
GPT-4o Mini (Cloud)	$0	$38
Claude Haiku (Cloud)	$0	$75
RTX 4080 + Llama (Local)	$1,200	$10 (electricity)
GPT-4o (Cloud)	$0	$625
Cloud GPU Rental	$0	$200-500

Break-Even Analysis

For GPT-4o Mini equivalent quality:

Cloud cost: $38/month
Local GPU: $1,200 upfront + $10/month electricity

Break-even: ~40 months

For higher volume (100M tokens/month):
Cloud cost: $380/month
Break-even: ~3 months

Best Practices

1. Start Small

Begin with:
1. Small 7B parameter model
2. Test on your hardware
3. Gradually increase size
4. Find your sweet spot

2. Use Appropriate Models

Match model to task:
- Simple tasks: phi3:mini (fast, small)
- General tasks: llama3.1:8b (balanced)
- Code tasks: codellama:13b (specialized)
- Complex tasks: llama3.1:70b (quality)

3. Keep Models Updated

# Update installed models
ollama pull llama3.1:8b

# Check for updates regularly
ollama list

4. Monitor Resources

# GPU monitoring
watch -n1 nvidia-smi

# System resources
htop

[Supported Models](/docs/models/supported-models): Compare with cloud models
[API Keys](/docs/models/api-keys): Configure cloud providers
[Model Configuration](/docs/models/model-configuration): Tune model parameters

API Keys

Model Configuration

Local Models

Local Models

Why Use Local Models?

When to Use Local Models

Ollama Setup

Installation

Starting Ollama

Pulling Models

Testing Ollama

Connecting to HIVE Protocol

Configuration

Network Configuration

Supported Local Models

Llama 3.1 (Recommended)

Mistral

Code Llama

Qwen 2.5

Phi-3

Hardware Requirements

Minimum Requirements

Recommended for Performance

GPU Comparison

Model Size vs Performance

Performance Optimization

GPU Configuration

Memory Optimization

Quantization Options

Using Local Models in HIVE

Agent Configuration

Mixed Model Setup

Other Local Solutions

LM Studio

Text Generation WebUI

llama.cpp

Troubleshooting

Ollama Not Starting

Model Download Fails

Slow Performance

Out of Memory

Cost Comparison

Local vs Cloud Costs

Break-Even Analysis

Best Practices

1. Start Small

2. Use Appropriate Models

3. Keep Models Updated

4. Monitor Resources

Related Documentation

Cookie Preferences