The fastest way to get started is to choose one of two routes: use a hosted API or run the model weights locally. If you want minimal setup, start with Z.ai’s API quick start: create an API key, pick the GLM-5 model name, and make a simple HTTP request (or use an official SDK) to send a prompt and receive a response. This lets you validate prompt patterns and output formats before you invest in GPU provisioning, model downloads, and serving. If you prefer local control, download the model weights from the official model hosting page and run inference with a compatible runtime (commonly Transformers-based tooling or an optimized serving engine).
A practical “hello world” path for developers looks like this: (1) call the API with a small prompt and verify you can stream or receive completions, (2) lock down basic safety and formatting (system message rules, JSON schema constraints, maximum tokens), and (3) add logging so you can trace prompts, model settings, and outputs for debugging. For example, if you’re building an internal assistant, you might start with a single endpoint /chat that forwards messages to GLM-5 and returns the model output. Then, iterate by adding guardrails: reject prompts that exceed size limits, enforce response JSON with a parser, and include “do not guess” instructions for unknown questions. On the local route, do the same but add performance checks: latency per token, peak VRAM usage, and throughput under concurrent load.
Once basic prompting works, the next “starter” milestone should be retrieval—because most developer products need accurate answers over your own docs. Create a small corpus (say 200–2,000 documents), generate embeddings, and store them in Milvus or managed Zilliz Cloud. Add metadata fields so you can filter by product/version and avoid mixing contexts. Then implement a simple RAG loop: user question → embed → top-k retrieve → prompt GLM-5 with retrieved chunks → produce answer. This is usually where early prototypes become useful: you can answer questions about APIs, internal runbooks, or onboarding docs without putting that data into model weights. After that, you can expand to tool calling (functions) and multi-step workflows, but retrieval first will give you the biggest quality jump for the least complexity.
