feat: add lightrag-mcp MCP server + agent tooling

- Add AGENTS.md with repo guidelines - Add lightrag-mcp: FastMCP server exposing insert_documents() + query_documents() to LLM agents via stdio transport, talks to LightRAG REST API - Add scripts/patch-vllm-cpu.py for CPU inference patching - Add .env.vllm for vLLM configuration - Update flake.nix with expanded dev shell - Update .env.lightrag - Remove CLAUDE.md (replaced by AGENTS.md)
2026-04-19 21:46:47 +08:00
parent c5dc2cf637
commit 4495a3cc62
14 changed files with 3510 additions and 100 deletions
@@ -1,18 +1,19 @@
-# LLM via Ollama
-LLM_BINDING=ollama
-LLM_MODEL=qwen3:0.6b
-LLM_BINDING_HOST=http://localhost:11434
+LLM_BINDING=openai
+LLM_MODEL=minimax/minimax-m2.7
+LLM_BINDING_HOST=https://openrouter.ai/api/v1
+LLM_BINDING_API_KEY=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

-# Embeddings via Ollama
+# Embeddings via Ollama (port 11434)
 EMBEDDING_BINDING=ollama
-EMBEDDING_MODEL=qwen3-embedding:0.6b
+EMBEDDING_MODEL=qwen3-embedding:4b
 EMBEDDING_BINDING_HOST=http://localhost:11434
-EMBEDDING_DIM=1024
+EMBEDDING_API_KEY=
+EMBEDDING_DIM=2560

 # Storage (local files)
 RAG_DIR=./rag_storage

-# Timeouts (in seconds) — increase for large local models
+# Timeouts (in seconds)
 EMBEDDING_TIMEOUT=60
 TIMEOUT=60

@@ -0,0 +1,13 @@
+# vllm server configuration
+# Used by: nix run .#vllm-start-llm  and  nix run .#vllm-start-embed
+
+# Force CPU backend — no CUDA/ROCm GPU on this machine
+VLLM_TARGET_DEVICE=cpu
+
+VLLM_LLM_MODEL=Qwen/Qwen3-0.6B
+VLLM_LLM_PORT=8000
+# VLLM_LLM_EXTRA_ARGS=--dtype bfloat16 --max-model-len 4096
+
+VLLM_EMBED_MODEL=Qwen/Qwen3-Embedding-0.6B
+VLLM_EMBED_PORT=8001
+# VLLM_EMBED_EXTRA_ARGS=--dtype bfloat16
@@ -0,0 +1,66 @@
+# RAGS
+
+Private learning tool. Ingest study materials → knowledge graph → query → export Anki flashcards.
+
+Two systems:
+- **LightRAG** (`lightrag/`) — graph-based RAG server (primary interface)
+- **Graphiti** (`graphiti/`) — temporal knowledge graph library (Python library only, needs Neo4j)
+
+## Quick Start
+
+```sh
+# Ollama must be running first on :11434 with:
+#   qwen3:0.6b       (LLM)
+#   qwen3-embedding:0.6b  (embeddings)
+
+# Start LightRAG only (LLM + embeddings handled externally by Ollama)
+nix run .#start
+# → http://localhost:9621/webui  (React frontend)
+# → http://localhost:9621/docs   (Swagger API)
+
+# Graphiti needs Neo4j running first
+nix run .#neo4j-start   # separate terminal
+nix develop .#graphiti
+```
+
+**Always enter via `nix develop` from repo root** — never activate venvs directly. The shellHook sources `.env.lightrag` and sets `LD_LIBRARY_PATH`.
+
+## Configuration
+
+### `.env.lightrag`
+**Restart LightRAG after changes.**
+
+| Var | Value |
+|-----|-------|
+| `LLM_BINDING` | `ollama` |
+| `LLM_MODEL` | `qwen3:0.6b` |
+| `LLM_BINDING_HOST` | `http://localhost:11434` |
+| `EMBEDDING_BINDING` | `ollama` |
+| `EMBEDDING_MODEL` | `qwen3-embedding:0.6b` |
+| `EMBEDDING_DIM` | `1024` |
+
+Verify embedding works:
+```sh
+curl -s http://localhost:11434/api/embed \
+  -H "Content-Type: application/json" \
+  -d '{"model":"qwen3-embedding:0.6b","input":"test"}'
+```
+
+**Critical:** If `EMBEDDING_DIM` changes, delete `rag_storage/` before restarting — old vectors are incompatible.
+
+## LightRAG Storage
+File-based by default (`JsonKVStorage`, `NanoVectorDBStorage`, `NetworkXStorage`). All data in `rag_storage/` (gitignored). Safe to delete to reset.
+
+## Nix / NixOS Notes
+- `UV_PYTHON` pinned to nix-provided Python 3.12 (system has 3.14)
+- `LD_LIBRARY_PATH` set in shellHook for native wheels
+- LightRAG installs with `--extra api --extra offline-llm`
+- WebUI (React/Bun) built on first shell entry if `lightrag/lightrag/api/webui/` missing
+
+## Known Issue: Pipeline Stuck
+
+After config changes, pipeline may show `busy: true` with pending async locks. Symptoms:
+- `GET /documents/pipeline_status` returns `busy: true`, `request_pending: true`
+- New inserts stay at `status: pending`
+
+Fix: delete `rag_storage/`, restart. Or `POST /documents/cancel_pipeline`.
@@ -1,80 +0,0 @@
-# CLAUDE.md
-
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
-## Purpose
-
-Private learning tool. Ingest study materials → build a knowledge graph → query concepts → export flashcards to Anki.
-
-Two systems:
- **LightRAG** (`lightrag/` submodule) — graph-based RAG server. Ingests documents, builds a knowledge graph, answers queries. Primary interface.
- **Graphiti** (`graphiti/` submodule) — temporal knowledge graph library. Tracks *when* concepts were learned and how understanding evolves. Used as a Python library, not a server.
-
-Both run fully local via Ollama. No cloud dependencies.
-
-## Running Things
-
-**Always enter via `nix develop` from the repo root — never activate the venv directly.** The shellHook sources `.env.lightrag` / `.env.graphiti` and sets `LD_LIBRARY_PATH` needed for native wheels on NixOS.
-
-```sh
-# LightRAG server (API + WebUI)
-nix develop .#lightrag
-lightrag-server
-# → http://localhost:9621/webui  (React frontend)
-# → http://localhost:9621/docs   (Swagger API)
-
-# Graphiti (library, no server)
-nix run .#neo4j-start   # required first, separate terminal
-nix develop .#graphiti
-
-# Neo4j management
-nix run .#neo4j-start
-nix run .#neo4j-stop
-```
-
-## Current Models (Ollama)
-
-| Role | Model | Dim |
-|------|-------|-----|
-| LLM | `qwen3:0.6b` | — |
-| Embeddings | `qwen3-embedding:0.6b` | 1024 |
-
-**Critical:** if the embedding model or `EMBEDDING_DIM` changes, `rag_storage/` must be deleted before restarting — old vectors are incompatible.
-
-## Configuration
-
-`.env.lightrag` is sourced by the shellHook and read by `lightrag-server` at startup. **Changes require a server restart** — the server does not hot-reload env vars.
-
-Key vars:
- `LLM_MODEL` / `EMBEDDING_MODEL` — Ollama model tags
- `EMBEDDING_DIM` — must exactly match what the embedding model outputs (verify with `curl -s http://localhost:11434/api/embed -d '{"model":"<name>","input":"test"}' | python3 -c "import sys,json; d=json.load(sys.stdin); print(len(d['embeddings'][0]))"`)
- `EMBEDDING_TIMEOUT` / `TIMEOUT` — in seconds; worker execution timeout is `2× EMBEDDING_TIMEOUT`
- `RAG_DIR` — resolved relative to where `lightrag-server` is invoked (inside `lightrag/` subdir due to shellHook `cd`)
-
-## Infrastructure Notes
-
-### Nix / NixOS
- Impure devShells: Nix provides Python 3.12 + uv; `uv sync` installs PyPI deps into `lightrag/.venv` or `graphiti/.venv` at shell entry.
- `LD_LIBRARY_PATH` is set in shellHook for `libstdc++.so.6` — required for numpy and other native wheels on NixOS.
- `UV_PYTHON` is pinned to the nix-provided Python 3.12 binary to prevent uv from picking up the system Python (3.14 on this machine).
- LightRAG installs with `--extra api --extra offline-llm` (the `ollama` Python package lives in `offline-llm`, not `api`).
- WebUI (React/Bun) is built on first shell entry if `lightrag/lightrag/api/webui/` doesn't exist.
-
-### Ollama
- Configured in `~/nix-config/machines/n1n1/services/ollama.nix`
- Uses `pkgs.ollama-rocm` (AMD ROCm) — iGPU is detected and used by default
- `OLLAMA_NUM_GPU=0` is set in NixOS config to force CPU-only mode (iGPU was consuming shared RAM)
- Ollama CORS origin includes `http://127.0.0.1:8080` (open-webui) and `https://ollama.jibai.dev`
-
-### LightRAG Storage
-File-based by default (`JsonKVStorage`, `NanoVectorDBStorage`, `NetworkXStorage`). All data lives in `rag_storage/` (gitignored). Safe to delete entirely to reset.
-
-## Known Issues / Active Debugging
-
-**LightRAG pipeline getting stuck**: After a server restart following config changes, the pipeline shows `busy: true` with pending async locks but doesn't process documents. Symptoms:
- `GET /documents/pipeline_status` returns `busy: true`, `request_pending: true`
- `keyed_locks.pending_async_cleanup` > 0
- New inserts stay at `status: pending` indefinitely
- `POST /documents/cancel_pipeline` may be needed to unblock
-
-The root cause is not yet determined. Suspicion: stale lock state inherited from previous failed runs persisted in `rag_storage/` JSON files. Try deleting `rag_storage/` and restarting the server fresh.
@@ -3,11 +3,17 @@

  inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";

-  outputs = { self, nixpkgs }:
+  outputs =
+    { self, nixpkgs }:
    let
      system = "x86_64-linux";
      pkgs = nixpkgs.legacyPackages.${system};

+      stdLibs = pkgs.lib.makeLibraryPath [
+        pkgs.stdenv.cc.cc
+        pkgs.zlib
+      ];
+
      startNeo4j = pkgs.writeShellScript "start-neo4j" ''
        set -e
        : "''${RAGS_ROOT:=$PWD}"
@@ -41,18 +47,77 @@
        ${pkgs.neo4j}/bin/neo4j stop
      '';

-    in {
+      startAll = pkgs.writeShellScript "start-all" ''
+        set -e
+        : "''${RAGS_ROOT:=$PWD}"
+
+        if [ -f "$RAGS_ROOT/.env.lightrag" ]; then
+          set -a; source "$RAGS_ROOT/.env.lightrag"; set +a
+        fi
+
+        LIGHTRAG_BIN="$RAGS_ROOT/lightrag/.venv/bin/lightrag-server"
+        LOG_DIR="$RAGS_ROOT/logs"
+        mkdir -p "$LOG_DIR"
+
+        LIGHTRAG_PID=""
+        cleanup() {
+          echo ""
+          echo "Shutting down..."
+          [ -n "$LIGHTRAG_PID" ] && kill "$LIGHTRAG_PID" 2>/dev/null || true
+          wait 2>/dev/null || true
+        }
+        trap cleanup EXIT INT TERM
+
+        echo "Starting LightRAG server..."
+        "$LIGHTRAG_BIN" >> "$LOG_DIR/lightrag.log" 2>&1 &
+        LIGHTRAG_PID=$!
+
+        wait_for() {
+          local label=$1 url=$2 tries=0
+          printf "  Waiting for %s" "$label"
+          while ! ${pkgs.curl}/bin/curl -so /dev/null --max-time 2 "$url" 2>/dev/null; do
+            tries=$((tries+1))
+            [ $tries -ge 300 ] && { echo " TIMEOUT — check logs/$label.log"; exit 1; }
+            printf "."
+            sleep 1
+          done
+          echo " ready"
+        }
+
+        wait_for "lightrag" "http://localhost:9621/docs"
+
+        echo ""
+        echo "All services up:"
+        echo "  LightRAG webui: http://localhost:9621/webui"
+        echo "  LightRAG API:   http://localhost:9621/docs"
+        echo "  Ollama LLM:     http://localhost:11434 (external)"
+        echo "  Ollama embed:   http://localhost:11434/api/embed (external)"
+        echo "  logs:           $LOG_DIR/"
+        echo ""
+        echo "Ctrl+C to stop everything."
+        echo ""
+
+        tail -f "$LOG_DIR/lightrag.log"
+      '';
+
+    in
+    {
      devShells.${system} = {

        lightrag = pkgs.mkShell {
-          packages = [ pkgs.uv pkgs.python312 pkgs.curl pkgs.bun ];
+          packages = [
+            pkgs.uv
+            pkgs.python312
+            pkgs.curl
+            pkgs.bun
+          ];

          shellHook = ''
            RAGS_ROOT="$PWD"
            export VIRTUAL_ENV="$RAGS_ROOT/lightrag/.venv"
            export UV_PROJECT_ENVIRONMENT="$VIRTUAL_ENV"
            export UV_PYTHON="${pkgs.python312}/bin/python3.12"
-            export LD_LIBRARY_PATH="${pkgs.lib.makeLibraryPath [ pkgs.stdenv.cc.cc pkgs.zlib ]}:$LD_LIBRARY_PATH"
+            export LD_LIBRARY_PATH="${stdLibs}:$LD_LIBRARY_PATH"

            echo "Syncing lightrag venv..."
            (cd "$RAGS_ROOT/lightrag" && uv sync --extra api --extra offline-llm --quiet)
@@ -69,22 +134,27 @@

            echo ""
            echo "LightRAG shell ready."
-            echo "  start:  lightrag-server"
+            echo "  start server:  lightrag-server"
+            echo "  start all:     nix run .#start"
            echo "  config:        $RAGS_ROOT/.env.lightrag"
-            echo "  needs:  ollama with qwen3:0.6b + qwen3-embedding:0.6b"
            echo ""
          '';
        };

        graphiti = pkgs.mkShell {
-          packages = [ pkgs.uv pkgs.python312 pkgs.neo4j pkgs.curl ];
+          packages = [
+            pkgs.uv
+            pkgs.python312
+            pkgs.neo4j
+            pkgs.curl
+          ];

          shellHook = ''
            RAGS_ROOT="$PWD"
            export VIRTUAL_ENV="$RAGS_ROOT/graphiti/.venv"
            export UV_PROJECT_ENVIRONMENT="$VIRTUAL_ENV"
            export UV_PYTHON="${pkgs.python312}/bin/python3.12"
-            export LD_LIBRARY_PATH="${pkgs.lib.makeLibraryPath [ pkgs.stdenv.cc.cc pkgs.zlib ]}:$LD_LIBRARY_PATH"
+            export LD_LIBRARY_PATH="${stdLibs}:$LD_LIBRARY_PATH"
            cd "$RAGS_ROOT/graphiti"

            echo "Syncing graphiti venv..."
@@ -99,7 +169,6 @@
            echo "Graphiti shell ready."
            echo "  neo4j:  nix run .#neo4j-start   (in another terminal, run first)"
            echo "  config: $RAGS_ROOT/.env.graphiti"
-            echo "  needs:  ollama with qwen3:0.6b + qwen3-embedding:0.6b"
            echo ""
          '';
        };
@@ -107,8 +176,18 @@
      };

      apps.${system} = {
-        neo4j-start = { type = "app"; program = "${startNeo4j}"; };
-        neo4j-stop  = { type = "app"; program = "${stopNeo4j}"; };
+        start = {
+          type = "app";
+          program = "${startAll}";
+        };
+        neo4j-start = {
+          type = "app";
+          program = "${startNeo4j}";
+        };
+        neo4j-stop = {
+          type = "app";
+          program = "${stopNeo4j}";
+        };
      };
    };
 }
@@ -0,0 +1,3 @@
+OPENAI_API_KEY=your-openai-api-key-here
+LIGHTRAG_WORKING_DIR=./lightrag_workspace
+LIGHTRAG_EMBEDDING_MODEL=text-embedding-3-small
@@ -0,0 +1 @@
+3.10
@@ -0,0 +1,57 @@
+import os
+import httpx
+from fastmcp import FastMCP
+
+LIGHTRAG_URL = os.getenv("LIGHTRAG_URL", "http://localhost:9621")
+
+mcp = FastMCP("LightRAG")
+
+
+@mcp.tool
+async def insert_documents(documents: list[str]) -> str:
+    """Insert text documents into LightRAG for indexing.
+
+    Args:
+        documents: List of document strings to index. Each string is treated as a separate document.
+
+    Returns:
+        Tracking ID for the insertion operation.
+    """
+    async with httpx.AsyncClient(timeout=120.0) as client:
+        r = await client.post(
+            f"{LIGHTRAG_URL}/documents/texts",
+            json={"texts": documents},
+        )
+        r.raise_for_status()
+        data = r.json()
+        return data.get("track_id", data.get("message", "unknown"))
+
+
+@mcp.tool
+async def query_documents(query: str, mode: str = "mix", top_k: int = 60) -> dict:
+    """Query LightRAG and retrieve relevant context without LLM generation.
+
+    Args:
+        query: The search query string.
+        mode: Retrieval mode - "local", "global", "hybrid", "naive", "mix" (default: "mix").
+        top_k: Number of top results to retrieve (default: 60).
+
+    Returns:
+        Structured retrieval data including entities, relationships, and text chunks.
+    """
+    async with httpx.AsyncClient(timeout=120.0) as client:
+        r = await client.post(
+            f"{LIGHTRAG_URL}/query/data",
+            json={
+                "query": query,
+                "mode": mode,
+                "only_need_context": True,
+                "top_k": top_k,
+            },
+        )
+        r.raise_for_status()
+        return r.json()
+
+
+if __name__ == "__main__":
+    mcp.run()
@@ -0,0 +1,11 @@
+[project]
+name = "lightrag-mcp"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+    "fastmcp>=3.2.4",
+    "httpx>=0.28.1",
+    "lightrag-hku>=1.4.15",
+]
@@ -0,0 +1,83 @@
+import asyncio
+import os
+from mcp import ClientSession, StdioServerParameters
+from mcp.client.stdio import stdio_client
+
+
+async def main():
+    server = StdioServerParameters(
+        command="uv",
+        args=[
+            "run",
+            "--directory",
+            "/home/df/projects/rags/lightrag-mcp",
+            "python",
+            "main.py",
+        ],
+    )
+
+    async with stdio_client(server) as (read, write):
+        async with ClientSession(read, write) as session:
+            await session.initialize()
+
+            print("--- INSERT ---")
+            result = await session.call_tool(
+                "insert_documents",
+                arguments={
+                    "documents": [
+                        "Python is a high-level programming language known for its simplicity and readability.",
+                        "JavaScript was created in 1995 by Brendan Eich at Netscape.",
+                        "Machine learning is a subset of artificial intelligence that enables systems to learn from data.",
+                    ]
+                },
+            )
+            print(f"Insert result: {result.content[0].text[:200]}")
+
+            print("\n--- QUERY (mix) ---")
+            result = await session.call_tool(
+                "query_documents",
+                arguments={
+                    "query": "Tell me about programming languages",
+                    "mode": "mix",
+                    "top_k": 60,
+                },
+            )
+            import json
+
+            data = json.loads(result.content[0].text)
+            d = data.get("data", {})
+            print(f"Entities: {len(d.get('entities', []))}")
+            print(f"Relationships: {len(d.get('relationships', []))}")
+            print(f"Chunks: {len(d.get('chunks', []))}")
+            for c in d.get("chunks", [])[:2]:
+                print(f"  - {c.get('content', '')[:100]}")
+
+            print("\n--- QUERY (local) ---")
+            result = await session.call_tool(
+                "query_documents",
+                arguments={"query": "What is Python?", "mode": "local", "top_k": 60},
+            )
+            data = json.loads(result.content[0].text)
+            d = data.get("data", {})
+            print(f"Entities: {len(d.get('entities', []))}")
+            print(f"Chunks: {len(d.get('chunks', []))}")
+
+            print("\n--- QUERY (global) ---")
+            result = await session.call_tool(
+                "query_documents",
+                arguments={
+                    "query": "What topics are covered?",
+                    "mode": "global",
+                    "top_k": 60,
+                },
+            )
+            data = json.loads(result.content[0].text)
+            d = data.get("data", {})
+            print(f"Entities: {len(d.get('entities', []))}")
+            print(f"Relationships: {len(d.get('relationships', []))}")
+
+    print("\nDone!")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
@@ -0,0 +1,77 @@
+import httpx
+import asyncio
+
+
+async def main():
+    base_url = "http://localhost:9621"
+
+    async with httpx.AsyncClient(timeout=120.0) as client:
+        print("--- INSERT ---")
+        docs = [
+            "Python is a high-level programming language known for its simplicity and readability.",
+            "JavaScript was created in 1995 by Brendan Eich at Netscape.",
+            "Machine learning is a subset of artificial intelligence that enables systems to learn from data.",
+            "LightRAG combines knowledge graph and vector retrieval for enhanced RAG applications.",
+            "FastMCP is a framework for building MCP servers in Python.",
+        ]
+        r = await client.post(f"{base_url}/documents/texts", json={"texts": docs})
+        r.raise_for_status()
+        print(f"Inserted: {r.json()}")
+
+        print("\n--- QUERY (mix mode) ---")
+        r = await client.post(
+            f"{base_url}/query/data",
+            json={
+                "query": "Tell me about programming languages",
+                "mode": "mix",
+                "only_need_context": True,
+                "top_k": 60,
+            },
+        )
+        r.raise_for_status()
+        result = r.json()
+        print(f"mode=mix keys: {list(result.keys())}")
+        if "chunks" in result:
+            print(f"  chunks: {len(result['chunks'])} returned")
+            for c in result["chunks"][:2]:
+                print(f"    - {c.get('content', '')[:100]}...")
+
+        print("\n--- QUERY (local mode) ---")
+        r = await client.post(
+            f"{base_url}/query/data",
+            json={
+                "query": "What is Python?",
+                "mode": "local",
+                "only_need_context": True,
+                "top_k": 60,
+            },
+        )
+        r.raise_for_status()
+        result = r.json()
+        print(f"mode=local keys: {list(result.keys())}")
+        if "chunks" in result:
+            print(f"  chunks: {len(result['chunks'])} returned")
+
+        print("\n--- QUERY (global mode) ---")
+        r = await client.post(
+            f"{base_url}/query/data",
+            json={
+                "query": "What topics are covered?",
+                "mode": "global",
+                "only_need_context": True,
+                "top_k": 60,
+            },
+        )
+        r.raise_for_status()
+        result = r.json()
+        print(f"mode=global keys: {list(result.keys())}")
+        if "entities" in result:
+            print(f"  entities: {len(result['entities'])} returned")
+        if "relationships" in result:
+            print(f"  relationships: {len(result['relationships'])} returned")
+
+    print("\nDone!")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+"""
+Patch vllm's cpu_platform_plugin to respect VLLM_TARGET_DEVICE=cpu.
+
+The upstream CUDA build only activates the CPU platform on macOS or when
+the version string contains "cpu" (source builds). This patch adds a third
+condition: if VLLM_TARGET_DEVICE=cpu is set in the environment.
+
+Run after every `uv pip install vllm` — idempotent.
+"""
+import pathlib
+import sys
+
+venv = pathlib.Path(__file__).parent.parent / "vllm" / ".venv"
+target = venv / "lib" / "python3.12" / "site-packages" / "vllm" / "platforms" / "__init__.py"
+
+if not target.exists():
+    print(f"vllm not installed at {target}, skipping patch")
+    sys.exit(0)
+
+content = target.read_text()
+
+if "VLLM_TARGET_DEVICE" in content:
+    print("patch already applied")
+    sys.exit(0)
+
+old = '''\
+        if not is_cpu:
+            import sys
+
+            is_cpu = sys.platform.startswith("darwin")
+            if is_cpu:
+                logger.debug(
+                    "Confirmed CPU platform is available because the machine is MacOS."
+                )'''
+
+new = old + '''
+
+        if not is_cpu:
+            is_cpu = os.environ.get("VLLM_TARGET_DEVICE", "").lower() == "cpu"
+            if is_cpu:
+                logger.debug(
+                    "Confirmed CPU platform is available because VLLM_TARGET_DEVICE=cpu."
+                )'''
+
+if old not in content:
+    print("ERROR: patch target not found — vllm version may have changed", file=sys.stderr)
+    sys.exit(1)
+
+target.write_text(content.replace(old, new, 1))
+print("patched cpu_platform_plugin to respect VLLM_TARGET_DEVICE=cpu")