Comparing llama-cpp and vllm in model serving
In Log Detective, we’re struggling with scalability right now. We are running an LLM serving service in the background using llama-cpp. Since users will interact with it, we need to make sure they’ll get a solid experience and won’t need to wait minutes to get an answer. Or even worse, see nasty errors.
What’s going to happen when 5, 15 or 10000 people try Log Detective service at the same time?
Let’s start the research.
…