Run a vLLM Server on HF Jobs in One Command
Summary
Hugging Face Jobs now allows you to spin up a private OpenAI-compatible LLM endpoint with a single command using vLLM, without provisioning servers or Kubernetes.
View Cached Full Text
Cached at: 06/25/26, 11:10 PM
Run a vLLM Server on HF Jobs in One Command
Source: https://huggingface.co/blog/vllm-jobs Back to Articles
- Prerequisites
- Launch the server
- Query it from anywhere
- Clean up
- Going further: bigger models
- Going further: Chat with it in a UI
- Going further: SSH into the running server
- Going further: Use it as a coding-agent backend with Pi
- HF Jobs or Inference Endpoints?
- Further reading
You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second. Once it’s up, you can query it from your laptop, a notebook, or anywhere else.
It’s the quickest way to stand up a model for tests, evals, or batch generation. (If you’re after a managed, production-ready service instead, that’s whatInference Endpointsare for —more on when to pick whichat the end.)
Here’s the whole thing end to end.
https://huggingface.co/blog/vllm-jobs#prerequisitesPrerequisites
- A payment method or a positive prepaid credit balance (Jobs is billed per‑minute by hardware usage).
huggingface\_hub \>= 1\.20\.0:pip install \-U "huggingface\_hub\>=1\.20\.0".- Logged in locally:
hf auth login.
https://huggingface.co/blog/vllm-jobs#launch-the-serverLaunch the server
hf jobs runisdocker runfor HF infrastructure. We use the officialvllm/vllm\-openaiimage, ask for a GPU with\-\-flavor, and expose vLLM’s port with\-\-expose:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
\-\-expose 8000routes the container’s port through HF’s public jobs proxy (see theServe Models guidefor the full reference). The command prints the URL your server is reachable at:
✓ Job started
id: 6a381ca1953ed90bfb947332
url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332
Hint: Exposed ports are reachable at (requires an HF token with read access to the job):
https://6a381ca1953ed90bfb947332--8000.hf.jobs
6a381ca1953ed90bfb947332is your job ID. Keep track of it, we’ll need it. We’ll use<job\_id\>as a placeholder for it in the rest of the post.
Give it a couple of minutes to download weights and boot. When the logs showApplication startup complete, you’re live.
https://huggingface.co/blog/vllm-jobs#query-it-from-anywhereQuery it from anywhere
vLLM speaks the OpenAI API, and every request just needs your HF token as a bearer token. The quickest way to hit it is curl:
curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
-H "Authorization: Bearer $(hf auth token)" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B",
"messages": [{"role": "user", "content": "Hello!"}],
"chat_template_kwargs": {"enable_thinking": false}
}'
which returns the usual OpenAI-style JSON, withchoices\[0\]\.message\.contentholding"Hello\! How can I assist you today? 😊".
Or, from Python, point the OpenAI client at the exposed URL and pass the token as the API key:
from huggingface_hub import get_token
from openai import OpenAI
client = OpenAI(
base_url="https://<job_id>--8000.hf.jobs/v1",
api_key=get_token(),
)
resp = client.chat.completions.create(
model="Qwen/Qwen3-4B",
messages=[{"role": "user", "content": "Hello!"}],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)
Hello! How can I assist you today? 😊
Quick health check before you start:curl https://<job\_id\>\-\-8000\.hf\.jobs/v1/models \-H "Authorization: Bearer $\(hf auth token\)"should list the model.
🔐 The endpoint is gated, not public.Every request must carry an HF token withread access to the job’s namespace. A plain browser visit will be rejected. In effect, the jobs proxyisyour API gate: access is scoped to you (and your org). That’s fine for private use, but treat the URL accordingly: don’t share it expecting it to be open, and don’t paste your token into untrusted places. If you need finer-grained or public access, put a proper gateway in front instead. Or seeHF Jobs or Inference Endpoints?below.
https://huggingface.co/blog/vllm-jobs#clean-upClean up
Jobs are billed per second, so stop the server when you’re done:
hf jobs cancel <job_id>
The\-\-timeoutyou set is a safety net (it’ll auto-stop), but cancelling explicitly is cheaper. Ana10g\-largeruns at $1.50/hour — checkhf jobs hardwarefor the full price list and pick the smallest flavor that fits your model.
https://huggingface.co/blog/vllm-jobs#going-further-bigger-modelsGoing further: bigger models
The same command scales to much larger models — pick a beefier\-\-flavorand tell vLLM to shard the model across the GPUs with\-\-tensor\-parallel\-size. For example, the 122B Qwen3.5 mixture-of-experts model on 2× H200:
hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3.5-122B-A10B \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
--max-model-len 32768 --max-num-seqs 256
\-\-tensor\-parallel\-sizeshould match the number of GPUs in the flavor (h200x2→ 2,h200x8→ 8). Runhf jobs hardwareto see what’s available and give bigger models a longer\-\-timeout, since they take longer to download and load. For large models, H200 flavors are usually the best value.
The\-\-max\-model\-len 32768 \-\-max\-num\-seqs 256flags are specific to this model: Qwen3.5-122B is a hybrid Mamba/attention architecture with a 256K-token default context, which doesn’t leave enough memory for vLLM’s default batch settings. Capping the context length and concurrent-sequence count keeps it within the GPUs’ memory. If a model fails to start with an out-of-memory or cache-block error, dialing these two down is the first thing to try. Everything else (the exposed URL, the OpenAI client, the token auth) stays exactly the same.
https://huggingface.co/blog/vllm-jobs#going-further-chat-with-it-in-a-uiGoing further: Chat with it in a UI
Prefer a chat window over curl? A few lines ofGradiopoint at the same endpoint. Add\-\-reasoning\-parser deepseek\_r1to thevllm servecommand so Qwen3’s thinking comes back as a separate field (not necessary, but helpful), then run this code locally (you’ll just need the job ID):
import gradio as gr
from gradio import ChatMessage
from huggingface_hub import get_token
from openai import OpenAI
client = OpenAI(base_url="https://<job_id>--8000.hf.jobs/v1", api_key=get_token())
def chat(message, history):
messages = [{"role": m["role"], "content": m["content"]} for m in history if not m.get("metadata")]
messages.append({"role": "user", "content": message})
stream = client.chat.completions.create(model="Qwen/Qwen3-4B", messages=messages, stream=True)
thinking, answer = "", ""
for chunk in stream:
delta = chunk.choices[0].delta
thinking += delta.model_extra.get("reasoning", "")
answer += delta.content or ""
out = []
if thinking.strip():
status = "done" if answer.strip() else "pending"
out.append(ChatMessage(role="assistant", content=thinking, metadata={"title": "💭 Thinking", "status": status}))
if answer.strip():
out.append(ChatMessage(role="assistant", content=answer))
yield out
gr.ChatInterface(chat).launch()
Run it, openhttp://127\.0\.0\.1:7860, and chat — reasoning streams into the collapsible panel, the answer below.
https://huggingface.co/blog/vllm-jobs#going-further-ssh-into-the-running-serverGoing further: SSH into the running server
Need to debug a startup failure, watch GPU memory, or tail logs interactively? You can open a shell straight into the running job. Launch it with\-\-sshand make sure your public key is registered athuggingface.co/settings/keys:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h --ssh \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
then connect with the job ID:
hf jobs ssh <job_id>
You’re now inside the container, where you can runnvidia\-smi, inspect the process, or poke at the model directly — which makes debugging and monitoring much easier than reading logs from the outside. SSH support requireshuggingface\_hub \>= 1\.20\.0.
https://huggingface.co/blog/vllm-jobs#going-further-use-it-as-a-coding-agent-backend-with-piGoing further: Use it as a coding-agent backend with Pi
The same endpoint can back a terminal coding agent.Piis a provider-agnostic agent harness. Point it at the job and you get a Read/Write/Edit/Bash agent running on your own self-hosted model.
One thing to set up first: agents drive the model through tool calls, and vLLM only accepts those if the server is launched with tool calling enabled. So relaunch with\-\-enable\-auto\-tool\-choiceand a\-\-tool\-call\-parsermatching the model family (hermesfor Qwen3). Agents also benefit from a stronger model, so this is a good place to bring in the bigger one:
hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3.5-122B-A10B \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
--max-model-len 32768 --max-num-seqs 256 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes
Then add the job as a custom provider in~/\.pi/agent/models\.json:
{
"providers": {
"hf-jobs": {
"baseUrl": "https://<job_id>--8000.hf.jobs/v1",
"api": "openai-completions",
"apiKey": "!hf auth token",
"models": [
{ "id": "Qwen/Qwen3.5-122B-A10B" }
]
}
}
}
Then launch the agent against it:
pi
The model you spun up a couple of commands ago, now driving an interactive coding agent in your terminal.
https://huggingface.co/blog/vllm-jobs#hf-jobs-or-inference-endpointsHF Jobs or Inference Endpoints?
HF Jobs isn’t the only way to serve a model on Hugging Face.Inference Endpointsare our managed product for the same job, and which one fits depends on what you’re after.
Reach forHF Jobswhen you want maximum flexibility and control: it’s justdocker runon HF infrastructure, so you pick the image, the exactvllm serveflags, and the hardware, and you pay per second for as long as the job runs. That makes it a great fit for experiments, one-off evals, batch generation, or kicking the tires on a model before committing to anything.
Reach forInference Endpointswhen you want something more production-ready. They add the operational niceties a long-lived service needs: finer-grained access control (an endpoint can be public, protected, or private), and scale-to-zero, so you’re not billed during periods of inactivity. If you’re standing up a durable endpoint rather than running a job, that’s the tool to grab.
https://huggingface.co/blog/vllm-jobs#further-readingFurther reading
This post sticks to vLLM, but the same expose-a-port pattern works with any OpenAI-compatible server. To serve GGUFs with llama.cpp or run SGLang instead, see theServe Models on Jobs guide, which walks through those backends.
Similar Articles
@SergioPaniego: one command and you have a private vllm server on HF infra point a coding agent straight at your own model, then spin i…
One command allows you to set up a private vLLM server on Hugging Face infrastructure, enabling a coding agent to point at your own model, and spin it down when done.
vllm-project/vllm v0.19.1
vLLM v0.19.1 release - a fast and easy-to-use open-source library for LLM inference and serving with state-of-the-art throughput, supporting 200+ model architectures and diverse hardware including NVIDIA/AMD GPUs and CPUs.
@TheAhmadOsman: Wanna replace Anthropic/OpenAI? START WITH THIS The bible for running LLMs locally is now available online to read for …
A comprehensive guide to running LLMs locally across various hardware and software setups is now available online for free, covering tools like llama.cpp, vLLM, and more.
@LeRobotHF: Train AI robots without writing a single line of code. We just launched LeLab, the official graphical user interface fo…
Hugging Face launched LeLab, a graphical user interface for LeRobot that enables training AI robots without command-line interaction, featuring zero-terminal setup, data collection, and one-click GPU training via Hugging Face Jobs.
@vllm_project: We just shipped a major redesign of http://recipes.vllm.ai. "How do I run model X on hardware Y for task Z?" now has a …
vLLM launched a redesigned recipes site that turns any HuggingFace model URL into a ready-to-run inference recipe for specific hardware and tasks.