Lessons learned from running the Log Detective service

07 February 2025

Log Detective service is live for more than two weeks now. Running an LLM inference server in production is a challenge.

We started with llama-cpp-python’s server initialy but switched over to llama-cpp server because of its parallel execution feature. I still need to benchmark it to see how much speedup we are getting.

This blog post highlights a few common challenges you might face when operating an inference server.

…

Keep reading

Comparing llama-cpp and vllm in model serving

01 November 2024

In Log Detective, we’re struggling with scalability right now. We are running an LLM serving service in the background using llama-cpp. Since users will interact with it, we need to make sure they’ll get a solid experience and won’t need to wait minutes to get an answer. Or even worse, see nasty errors.

What’s going to happen when 5, 15 or 10000 people try Log Detective service at the same time?

Let’s start the research.

…

Keep reading

Generating first set of data for LogDetective using InstructLab

19 September 2024

In the last blog (Using InstructLab in Log Detective), we went through the installation and set up process for InstructLab. The post finished with knowledge preparation. We’ll continue with that and hopefully end this one with data generated by InstructLab.

Fennel flower in our garden

…

Keep reading

Using InstructLab in Log Detective

06 September 2024

We are going to continue in the Log Detective series:

This time we’ll start exploring using InstructLab in the Log Detective infrastructure.

In this first post, we’ll obtain InstructLab and start the exploration. We will use the official RHEL AI container image that got recently released: https://www.redhat.com/en/about/press-releases/red-hat-enterprise-linux-ai-now-generally-available-enterprise-ai-innovation-production

Eggplant flower in our garden

…

Keep reading

Running logdetective service in containers with CUDA on EC2

15 August 2024

This is a follow up to my previous post “Running logdetective on an EC2 VM with CUDA”. Though this time, we’ll run the service and do our first inference!

From the previous post, we already have:

All steps to create a Containerfile
The EC2 VM with Tesla T4
Podman set up

…

Keep reading

Running logdetective on an EC2 VM with CUDA

30 July 2024

This is a followup to my previous blog about running logdetective on RHOAI with CUDA.

Here we’re starting with a fresh EC2 VM that has a nvidia GPU.

We have two challenges ahead of us:

Storage: CUDA takes a lot of space so we need to think ahead where we’ll store gigabytes of these binaries.
GCC: Right now CUDA support gcc from F39, while we have F40 as our host system.

We’ll run a F39 container rootless with the graphroot stored on an external volume to address both issues.

…

Keep reading

Running logdetective on Red Hat OpenShift AI with CUDA

09 July 2024

Let’s run Logdetective in Red Hat OpenShift AI using a Jupyter notebook with llama-cpp-python and CUDA.

Microsoft Designer: Futuristic detective who inspects shiny crystals, comics style.

…

Keep reading