Ollama

platform
https://ollama.com

Ollama makes local LLM deployment simple. Install it, pull a model, and you have a running inference server. It supports dozens of open-source models including Llama, Mistral, Qwen, DeepSeek, and Gemma.

The tool handles model management (downloading, updating, quantization), GPU acceleration (NVIDIA CUDA, Apple Metal), and serving via both a native API and an OpenAI-compatible endpoint. This makes it easy to integrate with existing tools and frameworks.

For agent systems, Ollama provides a privacy-preserving, zero-cost-per-token option for simpler tasks. Combined with cloud models for complex reasoning, it enables hybrid architectures that balance cost, privacy, and capability.

Key Features

  • One-command model download and serving
  • Native and OpenAI-compatible APIs
  • GPU acceleration (NVIDIA, Apple Silicon)
  • Model customization via Modelfiles
  • Supports dozens of open-source models

Pros

  • +Complete data privacy
  • +Zero per-token cost
  • +No internet dependency
  • +Simple CLI interface

Cons

  • Requires capable hardware
  • Models less capable than cloud frontier
  • Limited tool use reliability

In the Glossary