Unified MCP server for managing local model runtimes (Ollama, LM Studio, and more): provider-agnostic discovery, lifecycle, hardware-fit, and delegated inference.
An operations-first control plane for the models on your own machine. It discovers, inspects, fits, and manages local runtimes over their local HTTP APIs, exposing one consistent tool surface across them. It runs over stdio only and is a client to your runtimes — it never opens a network listener of its own.
The complete and embed tools delegate (offload) inference to a local model for cost control and privacy — keeping tokens and data on your hardware instead of a hosted API. They are inference primitives, not a conversational chat surface.
Each runtime is an adapter behind a single Provider interface. Every tool takes an optional provider argument; omit it and the tool operates across all detected runtimes.
| Adapter | Default host | Notes |
|---|---|---|
| Ollama | http://localhost:11434 | Native REST + OpenAI-compatible /v1; load/unload via keep_alive. |
| LM Studio | http://localhost:1234 | REST /api/v0 + OpenAI-compatible; lms CLI for load/unload/pull when present. |
Runtimes, host, live status, capabilities.
Installed models across providers.
Models resident in memory.
Detailed model metadata.
Download a model (multiple GB).
Delete a model; requires confirm.
Load a model into memory.
Evict a model from memory.
Liveness and version per provider.
RAM, CPU, and GPU/VRAM.
Does a model fit in VRAM or RAM?
Latency and tokens/sec.
Search a curated model catalog.
Recommend by task and hardware fit.
Delegate a completion (offload).
Delegate embedding generation.
npx @tmhs/local-ai-mcp
Configure via OLLAMA_HOST, LMSTUDIO_HOST, LOCAL_AI_REQUEST_TIMEOUT_MS, and LOCAL_AI_DETECT_TIMEOUT_MS.