Posts

Optimize LLMs for vllm deployment

Quantization techniques were mentioned in huggingface and unsloth, and I used those quantized models in ollama and llama.cpp. I always wonder how to implement it for vllm. Today I learnt to use llmcompressor to optimized models for vllm. import warnings warnings.filterwarnings("ignore") import os, gc, math, pathlib import torch from transformers import AutoTokenizer, AutoModelForCausalLM import warnings os.environ['TOKENIZERS_PARALLELISM'] = 'false' MODEL_DIR = "Qwen3-0.6B" OUTPUT_DIR = "Qwen3-0.6B-W4A16" print(f"Base model: {MODEL_DIR}") print(f"Quantized model: {OUTPUT_DIR}") from llmcompressor.

Posts

How to Run a Pod with a Fixed UID Outside the Default OpenShift UID Range

OpenShift enhances container security by assigning a random, non-root User ID (UID) to workloads by default. This helps isolate workloads running in different namespaces and prevents containers from running with predictable user IDs. While this security model works well for cloud-native applications, some third-party or legacy container images expect to run with a specific UID. A common example is the nginxinc/nginx-unprivileged image, which expects to run as UID 101.

Posts

one GEPA report of DSPy

apt install wireplumber libspa-0.2-bluetooth systemctl --user --now disable pipewire-media-session systemctl --user --now enable wireplumber

Posts

From Conventional Commits to LLM-Generated Release Notes

Introduction For several years, I adopted Conventional Commits across my software projects. The premise was straightforward: write commit messages in a structured, machine-readable format, then leverage tooling to generate changelogs and release notes automatically. For example: feat: add user login fix: resolve payment retry issue docs: update API usage guide This approach served me well. Tools like Conventional Changelog could parse commit history and produce structured release notes with minimal manual effort.

Posts

How I Keep Learning Without Forgetting Everything

Sometimes people look at my profile and wonder: “How can you keep learning so many things? What is your secret?” The truth is, there is no magic. My approach is simple: I write things down, practise important skills repeatedly, and choose technologies carefully after real hands-on exploration. 1. I Write Down What I Learn Whenever I learn something useful, I try to capture it. Sometimes I write it as a blog post.

Posts

one GEPA report of DSPy

What is GEPA? GEPA stands for Graph-based Evolutionary Program Adaptation — a DSPy optimizer that automatically improves the prompts/instructions of a multi-module LLM program through evolutionary search. It iteratively mutates module instructions, evaluates the changes, and keeps the best-performing candidates on a Pareto front. What’s Happening in This Run This file captures a GEPA optimization run on a financial news extraction system that classifies M&A (merger/acquisition) articles and extracts structured data from them.

Posts

Running Qwen3.6 MTP GGUF on AMD AI MAX 395 with llama.cpp ROCm

Background MTP support was recently merged into llama.cpp through the following pull request: MTP support merged into llama.cpp After the merge, I wanted to test MTP models on my mini-PC powered by the AMD AI MAX 395. I tried several approaches, including manually building llama.cpp and using Unsloth GGUF models directly. However, despite multiple attempts, I could not get a stable working setup. I also searched through GitHub issues and asked several AI assistants, including ChatGPT, Gemini, and DeepSeek.

Posts

Fault-Oblivious Stateful Workflows: Durable Execution Matters More Than Orchestration

Introduction Last year, I spent some time studying Oracle Banking Microservices Architecture (OBMA), together with enterprise schedulers and orchestration platforms such as Control-M . Part of the work involved understanding how to convert traditional Control-M jobs into Airflow DAGs. During this process, I started to observe an important architectural distinction: Not all workflows are the same. While studying OBMA, I noticed that Netflix Conductor was used as the workflow engine inside the architecture.

Posts

Working Around ROCm PyTorch Replacement Issues with uv and ComfyUI

Introduction When working with AMD GPUs and ROCm-based AI workloads, one common issue appears when using uv for Python dependency management. The problem becomes especially visible when setting up projects like ComfyUI on Linux with ROCm-enabled PyTorch builds. Although ROCm-specific wheels are manually installed, running commands such as uv add or dependency synchronization may silently replace ROCm-enabled packages with standard PyPI versions that only support CUDA. This leads to broken GPU acceleration and unexpected runtime failures.

Posts

Reducing Architecture Drift in Spec-Driven Development with coding agents and LLMs

Introduction Spec-driven development is becoming increasingly popular in the era of AI-assisted software engineering. Instead of starting directly from implementation, teams define specifications, domain rules, contracts, and architectural intentions first, allowing Large Language Models (LLMs) and automation tools to generate significant parts of the system. This approach can dramatically improve development speed, documentation quality, and alignment between business and engineering. However, one important challenge emerges quickly: Architecture drift.