InternLM/lmdeploy

CLAUDE.md

View on GitHub ↗Yours? Claim it ↗

§ 01 — Stats

Stars7.8k

Forks697

Prior1363

Quality—

Score—

Tasks—

§ 02 — Use

Drop into your project.

A CLAUDE.md is just a markdown file at the root of your repo. Copy the content below into your own project's CLAUDE.md to give your agent the same context.

One-line install · current directory

$npx versuz@latest install internlm-lmdeploy --kind=claude-md

Or curl directly

$curl -o CLAUDE.md https://raw.githubusercontent.com/InternLM/lmdeploy/HEAD/CLAUDE.md

Project typepython-data

Tokens

Embed badge

Show

Style

[![Versuz · InternLM/lmdeploy](https://versuz.dev/badge/claude-md/internlm-lmdeploy)](https://versuz.dev/claude-md/internlm-lmdeploy)

Show CLAUDE.md content (~1.3k tokens)

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Commands

**Linting:**

```bash
pre-commit run --all-files
```

Style: PEP8, max line length 120, double quotes, LF endings. C++ source under `src/` uses clang-format.

**Tests:**

```bash
pytest tests/test_lmdeploy                          # all unit tests
pytest tests/test_lmdeploy/test_model.py            # specific file
pytest tests/test_lmdeploy/test_lite/               # quantization tests
pytest tests/test_lmdeploy/test_vl/                 # vision-language tests
```

**Debug logging:**

```bash
LMDEPLOY_LOG_LEVEL=DEBUG python ...
```

**Build (TurboMind C++ extension):**

- Controlled via `setup.py` + CMake. Relevant env vars: `LMDEPLOY_TARGET_DEVICE` (default `cuda`), `DISABLE_TURBOMIND`, `CMAKE_BUILD_TYPE`, `CUDACXX`.
- Requirements split by device: `requirements/runtime_cuda.txt`, `runtime_ascend.txt`, etc.

## Architecture

### Two Backends, One Pipeline

`lmdeploy/pipeline.py` is the main user-facing entry point (`pipeline()` in `api.py`). It instantiates either the **PyTorch engine** (`lmdeploy/pytorch/`) or the **TurboMind engine** (`lmdeploy/turbomind/`) based on config.

### PyTorch Backend

**Model patching** is the core mechanism: HuggingFace models are loaded normally, then their layers are dynamically replaced with optimized LMDeploy implementations.

- `lmdeploy/pytorch/models/module_map.py` — registry mapping HF class names → LMDeploy replacement classes. Device-specific overrides in `DEVICE_SPECIAL_MODULE_MAP`.
- `lmdeploy/pytorch/models/patch.py` — applies the substitutions at runtime via `_get_rewrite_qualname()` / `_class_from_qualname()`.
- `lmdeploy/pytorch/models/` — 40+ per-model files (e.g., `llama.py`, `qwen.py`, `deepseek_v2.py`). Each reimplements attention, MLP, and embeddings using custom kernels.
- `lmdeploy/pytorch/nn/` — reusable optimized modules: `linear/` (AWQ, W8A8, blocked-FP8, LoRA variants), `attention.py`, `norm.py`, `rotary_embedding.py`, `moe/`.
- `lmdeploy/pytorch/kernels/` — Triton/CUDA kernels (e.g., `w8a8_triton_kernels.py`).
- `lmdeploy/pytorch/backends/` — kernel/operator dispatchers per quantization type (FP8, AWQ, CUDA).

**Engine execution flow (key files):**

- `engine.py` — main PyTorch engine.
- `paging/scheduler.py` — sequences → batches; prefill/decode, block eviction, prefix caching (`BlockTrie`).
- `engine/engine_loop.py` — async inference loop.
- (See `pytorch/engine/` and `pytorch/paging/` for full execution detail.)

**Configuration dataclasses** (`lmdeploy/pytorch/config.py`): `ModelConfig`, `CacheConfig`, `SchedulerConfig`, `BackendConfig`, `DistConfig`, `MiscConfig`.

### TurboMind Backend

- Python wrapper: `lmdeploy/turbomind/turbomind.py` (~800 lines). Bridges into `lmdeploy/lib/_turbomind` (pybind11 extension built from `src/turbomind/`).
- Tensor interop via `torch.from_dlpack()` / `_tm.from_dlpack()`.
- Config and model conversion: `lmdeploy/turbomind/deploy/config.py`, `supported_models.py`.
- Parallel config helpers: `update_parallel_config()`, `complete_parallel_config()` in `messages.py`.

### Lite / Quantization

Entrypoints in `lmdeploy/lite/apis/`: `calibrate.py` (main), `auto_awq.py`, `gptq.py`, `smooth_quant.py`.

**Flow:** load HF model → `CalibrationContext` collects activation statistics → scale computation (`lmdeploy/lite/quantization/`) → write quantized weights.

- `lite/quantization/awq.py` — AWQ (NORM_FCS_MAP, FC_FCS_MAP define per-model layer structure).
- `lite/quantization/weight/quantizer.py` — weight quantizer.
- `lite/quantization/activation/observer.py` — activation statistics.
- `lite/modeling/` — model-specific GPTQ implementations (e.g., `internlm2_gptq.py`).
- `lite/utils/cal_qparams.py` — quantization parameter calculation utilities.

Layer/norm/head mappings per model family are defined directly in `calibrate.py` and `awq.py`.

### Vision-Language Models

- `lmdeploy/vl/model/` — VLM preprocessing (InternVL, Qwen-VL, LLaVA, CogVLM, etc.).
- `lmdeploy/vl/media/` — image/video loaders and base classes.
- `lmdeploy/pytorch/multimodal/` — multimodal input handling for the PyTorch engine.
- Reference VLM implementation: `lmdeploy/vl/model/qwen3.py`.

### Other Key Files

- `lmdeploy/messages.py` — core types: `GenerationConfig`, `EngineConfig`, `TurbomindEngineConfig`, `SchedulerSequence`, `MessageStatus`.
- `lmdeploy/model.py` — chat templates; critical for correct conversation formatting.
- `lmdeploy/archs.py` — architecture registry mapping model arch names to runtime patches.
- `lmdeploy/tokenizer.py` — HuggingFace/SentencePiece tokenizer wrapper.
- `lmdeploy/serve/openai/` — OpenAI-compatible API server.

## Adding a New PyTorch Model

Use the `/support-new-model` skill for a complete step-by-step guide.