GenAI System Design
Practical notes on building and scaling LLM systems in production.
This blog covers the practical side of building and scaling GenAI systems in production. From inference optimization and memory management to retrieval pipelines and serving infrastructure, the focus is on how these systems actually work at scale, the trade-offs involved, and the decisions engineers face when putting them into production.
Recent Posts
-
Case Study: Designing a Document Intelligence Platform, From ML to GenAI to Hybrid
A senior engineer's walkthrough of the same two-capability document intelligence system built twice: first with traditional ML (BM25, collaborative filtering, learning to rank), then evolved with GenAI (dense retrieval, RAG, semantic ranking), and finally composed as a hybrid.
-
Case Study: Designing a Multi-Tenant LoRA Fine-Tuning and Serving Platform
A production deep dive into per-tenant adapter training pipelines, GPU memory management for shared base models with swappable LoRA adapters, heterogeneous batching, and adapter-aware routing at scale.
-
Case Study: Building a Domain-Specific Foundation Model for Healthcare
A production walkthrough of custom tokenizer design, transformer architecture decisions, distributed training across 256 GPUs, and the compute math behind pre-training a 7B medical language model from scratch.
-
Case Study: Post-Training a Foundation Model for Reasoning
A production walkthrough of supervised fine-tuning, reward modeling, RLHF vs DPO alignment, and how reinforcement learning teaches language models to reason through multi-step problems.
-
Case Study: Automating Mortgage Processing with LLM Agents in a Hybrid Cloud Bank
A production deep dive into hybrid cloud agent architecture, durable workflow orchestration, multi-country document processing, and why most of a mortgage pipeline should NOT be an agent.
Stay in the loop
Connect on LinkedIn for new posts and GenAI system design discussions.