F

GLM-5.2 FAST on Fireworks

Real, measured numbers — GSM8K at climbing concurrency and a 2K → 32K context sweep. No vendor stats, just the stopwatch.

Speed report · Fireworks FAST is the fastest model we've tested
Throughput
769.8
tok/s aggregate
total output ÷ batch wall time
Combined tokens/sec across 16 parallel requests — the system's total output under load. 1.7× the nearest provider.
TTFT
0.95s
time to first token
send → first token
Wait from sending the prompt to the first token coming back — the prefill cost before any text appears.
Decode
1,089
tok/s generation
output ÷ (latency − TTFT)
How fast a single request streams tokens once generation starts (prefill excluded, 2K context). 3× the next model.
Prefill
31,885
tok/s ingest
prompt tokens ÷ latency
How fast it reads the prompt — a full 32K-token context ingested in about a second.

Throughput vs concurrency

aggregate tok/s — does it keep scaling, or flatten?

total output ÷ batch wall timeSystem throughput under load — climbs until the server saturates, then flattens.

Why it matters: the steeper this climbs, the more users you can serve at once without each request slowing down.

Note: assumes constant available capacity. Other users share the same capacity and it may be autoscaling.

Decode speed vs context length

generation tok/s, prefill excluded — 2K → 32K tokens

output ÷ (latency − TTFT)Pure generation speed — flat means each request decodes at the same rate regardless of load.

Why it matters: the flatter this stays, the less your long prompts slow generation down — big context stays fast.