Gotcha: Nemotron-3-Nano requires specific vLLM tool parser + reasoning plugin¶

Symptom¶

Model responds with text like "We need to verify system health. Let me call check_health..." but never produces a structured tool_calls response. finish_reason: stop or finish_reason: length, never finish_reason: tool_calls.

Root cause¶

Nemotron-3-Nano-30B uses a custom tool-calling format (not Llama-style JSON, not Hermes-style, not OpenAI-style). It requires:

--tool-call-parser qwen3_coder — the tool call output format
--reasoning-parser nano_v3 — the reasoning/thinking parser
--reasoning-parser-plugin nano_v3_reasoning_parser.py — a custom plugin shipped with the model weights on HuggingFace

Without all three, the model generates reasoning text that DESCRIBES calling a tool but never produces the structured output the parser expects.

Fix¶

vllm serve /weights/Nemotron-3-Nano-30B-A3B-NVFP4 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser-plugin nano_v3_reasoning_parser.py \
    --reasoning-parser nano_v3

The working directory must be set to the model weights directory (where nano_v3_reasoning_parser.py lives) or provide the full path. In Docker: -w /weights/Nemotron-3-Nano-30B-A3B-NVFP4

Also: max_tokens must be sufficient¶

Nemotron generates a <think> reasoning block before the tool call. With max_tokens=200, the reasoning block consumes the budget before the tool call is emitted. Set max_tokens >= 512 for tool-calling prompts, >= 4096 for complex audit tasks.

How we found this¶

Tried: hermes, llama3_json, pythonic, llama — all failed. Searched the vLLM recipes and HuggingFace discussions. The official vLLM recipe at github.com/vllm-project/recipes specifies qwen3_coder with the nano_v3 reasoning parser.

Environment¶

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
vLLM 0.19.0
Pipeline parallel size 2