Local AI MCP

Unified MCP server for managing local model runtimes (Ollama, LM Studio, and more): provider-agnostic discovery, lifecycle, hardware-fit, and delegated inference.

npm

What it is

An operations-first control plane for the models on your own machine. It discovers, inspects, fits, and manages local runtimes over their local HTTP APIs, exposing one consistent tool surface across them. It runs over stdio only and is a client to your runtimes — it never opens a network listener of its own. Hosted providers (such as Moonshot AI) are an explicit opt-in behind the same tool surface.

Inference is delegation, not chat

The complete and embed tools delegate (offload) inference to a local model for cost control and privacy — keeping tokens and data on your hardware instead of a hosted API. They are inference primitives, not a conversational chat surface. complete streams tokens via MCP progress notifications when the client supplies a progress token.

Provider-adapter model

Each runtime is an adapter behind a single Provider interface. Every tool takes an optional provider argument; omit it and the tool operates across all detected runtimes.

Adapter	Default host	Notes
Ollama	`http://localhost:11434`	Native REST + OpenAI-compatible `/v1`; load/unload via `keep_alive`.
LM Studio	`http://localhost:1234`	REST `/api/v0` + OpenAI-compatible; `lms` CLI for load/unload/pull when present.
llama.cpp	`http://localhost:8080`	Native `/health` `/props` `/slots` + OpenAI `/v1`.
OpenAI-compat	unset	Opt-in via `OPENAI_COMPAT_HOST` (vLLM, Jan, etc.).
Moonshot AI (Kimi)	`https://api.moonshot.ai/v1`	Hosted OpenAI-compatible; requires `MOONSHOT_API_KEY`. `complete`/`listModels` only.

Tools (16)

list_providers

Runtimes, host, live status, capabilities.

list_models

Installed models across providers.

list_loaded

Models resident in memory.

model_info

Detailed model metadata.

pull_model heavy

Download a model (multiple GB).

remove_model destructive

Delete a model; requires confirm.

load_model

Load a model into memory.

unload_model

Evict a model from memory.

health_check

Liveness and version per provider.

system_resources

RAM, CPU, and GPU/VRAM.

fit_check

Does weight + KV cache fit in VRAM or RAM?

benchmark heavy

Latency and tokens/sec.

search_available

Search a curated model catalog.

suggest_model

Recommend by task and hardware fit.

complete

Delegate a completion (streaming).

embed

Delegate embedding generation.

Install

npx @tmhs/local-ai-mcp

Environment variables

OLLAMA_HOST — default http://localhost:11434
LMSTUDIO_HOST — default http://localhost:1234
LLAMACPP_HOST — default http://localhost:8080
OPENAI_COMPAT_HOST — optional generic OpenAI-compatible /v1 host
OPENAI_COMPAT_API_KEY — optional Bearer for OpenAI-compat
MOONSHOT_API_KEY — required to enable the Moonshot adapter
MOONSHOT_HOST — default https://api.moonshot.ai/v1
LOCAL_AI_REQUEST_TIMEOUT_MS — default 120000
LOCAL_AI_DETECT_TIMEOUT_MS — default 1500
LOCAL_AI_PULL_TIMEOUT_MS — default 3600000 (1 hour)