MEVZU N° TAG / VOL. 147
#serving
0 blog · 0 news · 9 wiki
Wiki
PagedAttention
A technique that manages the KV cache like virtual memory pages, eliminating waste and fragmentation.
- EN
- PagedAttention
- TR
- PagedAttention
Cold Start
The slow first response when a model or service has been idle and must initialise on demand.
- EN
- Cold Start
- TR
- Soğuk Başlatma
vLLM
An open-source inference framework that delivers high-throughput LLM serving via PagedAttention.
- EN
- vLLM
- TR
- vLLM
llama.cpp
Georgi Gerganov's open-source C++ project that made running LLMs locally a practical reality.
- EN
- llama.cpp
- TR
- llama.cpp
NVIDIA Triton
NVIDIA's open-source inference server designed to serve multiple frameworks and hardware backends.
- EN
- NVIDIA Triton
- TR
- NVIDIA Triton
Streaming Output
Sending the model's response token-by-token in real time rather than waiting for the complete answer.
- EN
- Streaming Output
- TR
- Akış Çıktısı
Continuous Batching
A dynamic serving technique where new requests can join an in-flight batch and finished ones leave immediately.
- EN
- Continuous Batching
- TR
- Sürekli Yığınlama
Ollama
A tool that makes downloading and running LLMs on your own machine as simple as a single command.
- EN
- Ollama
- TR
- Ollama
KV Cache
The cache that stores previously computed key/value vectors so the model doesn't recompute them every step.
- EN
- KV Cache
- TR
- KV Önbelleği