Comparing llama-cpp and vllm in model serving

In Log Detective, we’re struggling with scalability right now. We are running an LLM serving service in the background using llama-cpp. Since users will interact with it, we need to make sure they’ll get a solid experience and won’t need to wait minutes to get an answer. Or even worse, see nasty errors.

What’s going to happen when 5, 15 or 10000 people try Log Detective service at the same time?

Let’s start the research.

Autumn in southern Moravia

Keep reading

Generating first set of data for LogDetective using InstructLab

In the last blog (Using InstructLab in Log Detective), we went through the installation and set up process for InstructLab. The post finished with knowledge preparation. We’ll continue with that and hopefully end this one with data generated by InstructLab.

Fennel flower in our garden

Keep reading

Using InstructLab in Log Detective

We are going to continue in the Log Detective series:

  1. Introducing Log Detective
  2. Running logdetective on Red Hat OpenShift AI with CUDA
  3. Running logdetective on an EC2 VM with CUDA
  4. Running logdetective service in containers with CUDA on EC2

This time we’ll start exploring using InstructLab in the Log Detective infrastructure.

In this first post, we’ll obtain InstructLab and start the exploration. We will use the official RHEL AI container image that got recently released: https://www.redhat.com/en/about/press-releases/red-hat-enterprise-linux-ai-now-generally-available-enterprise-ai-innovation-production

Eggplant flower in our garden

Keep reading

Running logdetective service in containers with CUDA on EC2

This is a follow up to my previous post “Running logdetective on an EC2 VM with CUDA”. Though this time, we’ll run the service and do our first inference!

From the previous post, we already have:

  1. All steps to create a Containerfile
  2. The EC2 VM with Tesla T4
  3. Podman set up

Keep reading

Running logdetective on an EC2 VM with CUDA

This is a followup to my previous blog about running logdetective on RHOAI with CUDA.

Here we’re starting with a fresh EC2 VM that has a nvidia GPU.

We have two challenges ahead of us:

  1. Storage: CUDA takes a lot of space so we need to think ahead where we’ll store gigabytes of these binaries.

  2. GCC: Right now CUDA support gcc from F39, while we have F40 as our host system.

We’ll run a F39 container rootless with the graphroot stored on an external volume to address both issues.

Keep reading

Running logdetective on Red Hat OpenShift AI with CUDA

Let’s run Logdetective in Red Hat OpenShift AI using a Jupyter notebook with llama-cpp-python and CUDA.

Microsoft Designer: Futuristic detective who inspects shiny crystals, comics style.

Keep reading

Love and Hate

I made a significant discovery recently about my life.

I can love and hate something. At the same time. There are many of such things. This dynamic is affecting my whole life significantly. I actually mean those “love” and “hate” words. I love talking to people. But I can also hate it immensely as well. The imbalance can drive me crazy and I’m so glad I could finally put a name on this situation. I love you, and hate you, at the same time. The balance between the two changes every day. Like a sunset or a sunrise. Light and darkness live inside me.

Lightning Sunset

Keep reading