HIVE

Model Management

Local Models

Local Models

Run AI models locally on your own hardware using HIVE Protocol. Local models provide privacy, cost savings, and offline capability. This guide covers setup with Ollama, hardware requirements, and optimization tips.

Why Use Local Models?

┌─────────────────────────────────────────────────────────────────┐
│                    Local vs Cloud Models                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LOCAL MODELS                    CLOUD MODELS                   │
│  ────────────                    ────────────                   │
│  + Complete privacy              + No hardware needed           │
│  + No per-token costs            + Always latest models        │
│  + Works offline                 + Highest capability          │
│  + Data never leaves device      + Managed infrastructure      │
│  - Requires hardware             - Per-token costs             │
│  - Limited model selection       - Requires internet           │
│  - Self-managed                  - Data sent to provider       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

When to Use Local Models

ScenarioLocalCloud
Sensitive data processingYesEvaluate privacy policies
Offline operation requiredYesNo
High volume, cost-sensitiveYesConsider budget
Highest quality outputs neededUsually noYes
No GPU availablePossible but slowYes
Development and testingYesYes

Ollama Setup

Ollama is the recommended way to run local models with HIVE Protocol. It provides a simple interface and supports many popular open-source models.

Installation

macOS

# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Or use Homebrew
brew install ollama

Linux

# Download and install
curl -fsSL https://ollama.com/install.sh | sh

Windows

# Download from ollama.com/download
# Run the installer
# Ollama will start automatically

Starting Ollama

# Start the Ollama service
ollama serve

# The service runs at http://localhost:11434

Pulling Models

Download models before using them:

# Pull Llama 3.1 (recommended)
ollama pull llama3.1

# Pull specific size
ollama pull llama3.1:8b
ollama pull llama3.1:70b

# Pull other popular models
ollama pull mistral
ollama pull codellama
ollama pull qwen2.5

Testing Ollama

# Quick test
ollama run llama3.1 "Hello, how are you?"

# Interactive mode
ollama run llama3.1

# List installed models
ollama list

Connecting to HIVE Protocol

Configuration

  1. Go to Settings > Integrations > Local Models
  2. Enter Ollama endpoint: http://localhost:11434
  3. Click Test Connection
  4. Select available models
┌─────────────────────────────────────────────────────────────────┐
│  Settings > Local Models                                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Ollama Configuration                                            │
│                                                                  │
│  Endpoint: http://localhost:11434                                │
│                                                                  │
│  Status: [*] Connected                                          │
│                                                                  │
│  Available Models:                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ [x] llama3.1:8b          (4.7 GB)                       │    │
│  │ [x] mistral:7b           (4.1 GB)                       │    │
│  │ [x] codellama:13b        (7.4 GB)                       │    │
│  │ [ ] qwen2.5:72b          (41 GB) - Insufficient RAM     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│                                      [Test Connection] [Save]   │
└─────────────────────────────────────────────────────────────────┘

Network Configuration

For remote Ollama servers:

# On the Ollama server, allow external connections
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# In HIVE, configure the remote endpoint
http://your-server-ip:11434

Supported Local Models

Meta's latest open-source model, excellent for general tasks.

VariantSizeRAM RequiredBest For
llama3.1:8b4.7 GB8 GBFast responses, general use
llama3.1:70b40 GB48 GBHigher quality, complex tasks
ollama pull llama3.1:8b

Strengths:

  • Strong reasoning capabilities
  • Good instruction following
  • Multilingual support
  • Excellent for code

Mistral

Fast and capable model from Mistral AI.

VariantSizeRAM RequiredBest For
mistral:7b4.1 GB8 GBQuick tasks, chat
mixtral:8x7b26 GB32 GBComplex reasoning
ollama pull mistral

Strengths:

  • Very fast inference
  • Good at following instructions
  • Strong for European languages

Code Llama

Specialized for coding tasks.

VariantSizeRAM RequiredBest For
codellama:7b3.8 GB8 GBCode completion
codellama:13b7.4 GB16 GBCode generation
codellama:34b19 GB24 GBComplex code tasks
ollama pull codellama:13b

Strengths:

  • Excellent code completion
  • Multiple language support
  • Code explanation and review

Qwen 2.5

Alibaba's powerful open model.

VariantSizeRAM RequiredBest For
qwen2.5:7b4.4 GB8 GBGeneral tasks
qwen2.5:14b8.9 GB16 GBHigher quality
qwen2.5:72b41 GB48 GBBest quality
ollama pull qwen2.5:14b

Strengths:

  • Strong multilingual (especially Asian languages)
  • Good at math and reasoning
  • Large context window support

Phi-3

Microsoft's compact but capable model.

VariantSizeRAM RequiredBest For
phi3:mini2.2 GB4 GBLimited hardware
phi3:medium7.9 GB12 GBBetter quality
ollama pull phi3:mini

Strengths:

  • Runs on minimal hardware
  • Surprisingly capable for size
  • Fast responses

Hardware Requirements

Minimum Requirements

CPU-Only Operation:
  - Modern multi-core CPU (8+ cores recommended)
  - 16 GB RAM minimum
  - SSD storage
  - Expect slow responses (10-60 seconds)

For 7B Parameter Models:
  - 8 GB RAM
  - 5 GB disk space
  - CPU or entry-level GPU
GPU-Accelerated Operation:
  - NVIDIA GPU with 8+ GB VRAM
  - 32 GB system RAM
  - NVMe SSD
  - CUDA 11.8+ drivers

For 13B-34B Parameter Models:
  - 16-24 GB VRAM
  - 32 GB system RAM
  - 20-50 GB disk space

GPU Comparison

GPUVRAMMax Model SizePerformance
RTX 306012 GB7B comfortably~20 tokens/sec
RTX 308010 GB7B models~30 tokens/sec
RTX 408016 GB13B models~40 tokens/sec
RTX 409024 GB34B models~50 tokens/sec
A10040-80 GB70B models~100 tokens/sec

Model Size vs Performance

┌─────────────────────────────────────────────────────────────────┐
│  Model Size vs Quality/Speed Tradeoff                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Quality                                                         │
│    ^                                                             │
│    │                                          70B               │
│    │                                       * * *                │
│    │                                   34B                       │
│    │                              * * *                         │
│    │                         13B                                 │
│    │                     * * *                                  │
│    │                7B                                          │
│    │           * * *                                            │
│    │      3B                                                    │
│    │  * *                                                       │
│    └─────────────────────────────────────────────────────> Speed │
│                                                                  │
│  Smaller models = Faster but lower quality                      │
│  Larger models = Slower but higher quality                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Performance Optimization

GPU Configuration

Enable GPU acceleration:

# Check GPU detection
ollama ps

# Force GPU usage
export OLLAMA_NUM_GPU=999

# Limit GPU memory (useful for sharing)
export OLLAMA_GPU_MEMORY=6g

Memory Optimization

# Reduce memory footprint
export OLLAMA_NUM_PARALLEL=1

# Use quantized models (smaller, faster)
ollama pull llama3.1:8b-q4_0

# Keep models loaded for faster response
export OLLAMA_KEEP_ALIVE=60m

Quantization Options

QuantizationSize ReductionQuality Impact
Q8_0~50%Minimal
Q6_K~60%Very slight
Q5_K_M~65%Slight
Q4_K_M~70%Noticeable
Q4_0~75%More noticeable
# Pull quantized version
ollama pull llama3.1:8b-q4_K_M

Using Local Models in HIVE

Agent Configuration

Configure agents to use local models:

Agent: Local Research Assistant
Framework: ollama
Model: llama3.1:8b
Settings:
  temperature: 0.7
  context_length: 4096
  num_predict: 2048

Mixed Model Setup

Use local models for some agents, cloud for others:

Swarm: Hybrid Pipeline
Agents:
  - name: Data Processor
    framework: ollama
    model: llama3.1:8b
    # Local for privacy and cost

  - name: Quality Reviewer
    framework: anthropic
    model: claude-sonnet-4-20250514
    # Cloud for highest quality

  - name: Code Assistant
    framework: ollama
    model: codellama:13b
    # Local for code tasks

Other Local Solutions

LM Studio

GUI-based local model runner.

1. Download from lmstudio.ai
2. Install and launch
3. Download models from built-in browser
4. Start local server
5. Configure HIVE with endpoint: http://localhost:1234/v1

Text Generation WebUI

Advanced interface with many features.

# Clone repository
git clone https://github.com/oobabooga/text-generation-webui

# Run with API
python server.py --api

# Connect HIVE to: http://localhost:5000

llama.cpp

Direct C++ implementation for maximum performance.

# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Run server
./server -m model.gguf --port 8080

Troubleshooting

Ollama Not Starting

Symptom: "Error: could not connect to ollama app"

Solutions:
1. Check if Ollama is running:
   ps aux | grep ollama

2. Start Ollama service:
   ollama serve

3. Check logs:
   journalctl -u ollama (Linux)
   cat ~/.ollama/logs/server.log

Model Download Fails

Symptom: "Error pulling model"

Solutions:
1. Check disk space:
   df -h

2. Check network connection:
   curl -I https://ollama.com

3. Try alternative model source:
   ollama pull llama3.1:8b --insecure

Slow Performance

Symptom: Very slow responses

Solutions:
1. Verify GPU is being used:
   ollama ps
   nvidia-smi

2. Use smaller or quantized model:
   ollama pull llama3.1:8b-q4_0

3. Reduce context length in agent settings

4. Check for other GPU processes:
   nvidia-smi

Out of Memory

Symptom: "CUDA out of memory" or system freeze

Solutions:
1. Use smaller model variant
2. Use quantized version
3. Reduce context length
4. Close other applications
5. Enable GPU memory limit:
   export OLLAMA_GPU_MEMORY=6g

Cost Comparison

Local vs Cloud Costs

Assuming 10 million tokens per month:

SolutionInitial CostMonthly Cost
Gemini Flash (Cloud)$0$19
GPT-4o Mini (Cloud)$0$38
Claude Haiku (Cloud)$0$75
RTX 4080 + Llama (Local)$1,200$10 (electricity)
GPT-4o (Cloud)$0$625
Cloud GPU Rental$0$200-500

Break-Even Analysis

For GPT-4o Mini equivalent quality:

Cloud cost: $38/month
Local GPU: $1,200 upfront + $10/month electricity

Break-even: ~40 months

For higher volume (100M tokens/month):
Cloud cost: $380/month
Break-even: ~3 months

Best Practices

1. Start Small

Begin with:
1. Small 7B parameter model
2. Test on your hardware
3. Gradually increase size
4. Find your sweet spot

2. Use Appropriate Models

Match model to task:
- Simple tasks: phi3:mini (fast, small)
- General tasks: llama3.1:8b (balanced)
- Code tasks: codellama:13b (specialized)
- Complex tasks: llama3.1:70b (quality)

3. Keep Models Updated

# Update installed models
ollama pull llama3.1:8b

# Check for updates regularly
ollama list

4. Monitor Resources

# GPU monitoring
watch -n1 nvidia-smi

# System resources
htop
  • [Supported Models](/docs/models/supported-models): Compare with cloud models
  • [API Keys](/docs/models/api-keys): Configure cloud providers
  • [Model Configuration](/docs/models/model-configuration): Tune model parameters

Cookie Preferences

We use cookies to enhance your experience, analyze site traffic, and for marketing purposes. By clicking "Accept All", you consent to our use of cookies. Read our Privacy Policy for more information.