pytorch/torchtitan

TorchTitan Development Guide

View on GitHub ↗Yours? Claim it ↗

§ 01 — Stats

Stars5.3k

Forks811

Prior1442

Quality—

Score—

Tasks—

§ 02 — Use

Drop into your project.

A CLAUDE.md is just a markdown file at the root of your repo. Copy the content below into your own project's CLAUDE.md to give your agent the same context.

One-line install · current directory

$npx versuz@latest install pytorch-torchtitan --kind=claude-md

Or curl directly

$curl -o CLAUDE.md https://raw.githubusercontent.com/pytorch/torchtitan/HEAD/CLAUDE.md

Project typepython-data

Tokens

Embed badge

Show

Style

[![Versuz · pytorch/torchtitan](https://versuz.dev/badge/claude-md/pytorch-torchtitan)](https://versuz.dev/claude-md/pytorch-torchtitan)

Show CLAUDE.md content (~1.6k tokens)

# TorchTitan Development Guide

## Build & Test

```bash
# Install dev dependencies
pip install -r requirements.txt -r requirements-dev.txt

# Lint and format (required before any PR)
pre-commit run --all-files

# Run unit tests
pytest tests/ -x
```

### Run GPU integration tests (requires GPUs)
Integration tests override default config for Llama 3 debug model.
See tests/integration_tests/ for `OverrideDefinitions`.

### Validating Numerics
Non-computation changes (e.g. activation checkpointing, refactoring) must produce
**identical loss** before vs. after with `--debug.seed=42` and `--debug.deterministic`.
Computation changes require loss convergence on representative datasets (e.g. C4).

With the same parallelisms, GPU settings, and the debug options, two runs should produce
bit-wise identical loss and grad_norm. Note that stdout only prints the most
significant five digits, which may not be enough. Follow `scripts/loss_compare.py` to
enable profiling and check loss and grad_norm from the TensorBoard results.

You should NEVER use `--debug.deterministic_warn_only`.

## Core Principles

1. **PyTorch-native training techniques.** Core torchtitan's training infrastructure
   and parallelism code must not depend on non-PyTorch libraries. Techniques with
   moderate-to-large complexity belong in their proper upstream repo (pytorch/pytorch
   for parallelisms, pytorch/data for data loaders, etc.).

2. **Investigate root cause before patching.** Don't land band-aid fixes. Understand
   *why* something fails before proposing a solution. If a change seems to help but
   you can't explain why, dig deeper.

3. **Reuse over duplication.** Before writing new code, check if existing implementations
   already handle the case. Unify similar code paths across models rather than creating
   per-model wrappers. If upstream (torchao, PyTorch) already provides functionality,
   use it.

4. **Don't leak experiments into core.** The `torchtitan/experiments/` folder exists for
   a reason. Don't modify core torchtitan code to accommodate experiment-specific needs
   (e.g. don't add `if experiment_x:` branches to core files). Deprecated files should
   be removed, not updated.

5. **Protect battle-tested code paths.** Be cautious changing converged behavior. Flag
   potential silent breakage of existing user code or checkpoints. When in doubt, ask.

6. **Audit all callsites.** When changing shared code (common model components, config
   fields, distributed utilities), check and update every callsite. This includes all
   model variants: llama3, llama4, qwen3, deepseek_v3, gpt_oss, flux, etc.

## Code Style

### Naming
- Names must be **accurate, descriptive, and reflect actual scope**. Don't use
  "toy/test/temp" in production names — put that context in docstrings instead.
- Follow upstream conventions: match torchao and PyTorch naming where applicable.
  E.g. if torchao calls it `Float8Linear`, use `Float8Linear` not `Float8Config`.
- Use `num_` prefix for counts (e.g. `num_expert_groups` not `n_expert_groups`)
  when not directly matching an upstream API.
- **`axis` names a specific mesh axis; `dim` is for tensors and mesh shape.**
  In any name we own — variables, parameters, attributes, helpers, comments,
  docstrings, error messages — use ``axis``/``axes`` when referring to a
  specific ``DeviceMesh`` axis (TP axis, ``dp_shard`` axis, the list of axes
  a spec references). Use ``dim``/``dimensional`` for the mesh's *shape*
  ("1D mesh", "multi-dimensional SPMD mesh") and for tensor dimensions; bare
  ``dim`` on its own should refer to a tensor dimension. The exception is
  when calling into PyTorch upstream APIs (``DeviceMesh.mesh_dim_names``,
  ``DataParallelMeshDims``, etc.): match the upstream spelling at the call
  site, then assign into a locally named ``mesh_axis_names`` if the value
  flows through our code.

### Code Placement
- Put code in the **most general applicable location**:
  - Model-agnostic parallelism helpers → `torchtitan/distributed/`
  - Shared model components (attention, MoE, embeddings) → `torchtitan/models/common/`
  - Model-specific code → the specific model folder
- Don't put model-agnostic functionality in model-specific files just because
  that's where you first needed it.

### Assertions and Error Handling
- **`ValueError`** for user-facing errors (bad config, invalid input).
- **`assert`** only for internal invariants that indicate programmer error.
- Always validate mesh axes, tensor placements, and config values explicitly
  in distributed code — don't assume a 1D mesh or specific placements.
- When a code path silently skips user configuration, **emit a warning**.

### Parameters and Config
- Important parameters first; less important ones later.
- Prefer keyword-only arguments after the first positional arg.
- No `None` defaults for required config fields.
- `dataclasses.replace()` is a shallow copy: nested dataclasses and list/dict
  fields are shared by reference. Be explicit when deep copies are needed.

### Comments and Documentation
- Add comments only for genuinely non-obvious things: dimension semantics,
  parallelism gradient placements, why a workaround exists.
- Use TODO comments for known limitations with a brief explanation.
- Put descriptions in docstrings, not in names.

## PR Expectations

1. **Lint first.** Run `pre-commit run --all-files` and fix all issues before
   requesting review. CI linting failures waste everyone's time.
2. **Show numerical proof.** Include loss comparison for any non-trivial change.
3. **Explain "why" not just "what"** in the PR description.
4. **Add tests.** New features need CPU unit tests at minimum; GPU integration
   tests when involving parallelism. Verify CI actually runs the intended test
   config (check `--model.name` and other flags).
5. **Keep model code minimal.** After model changes, ensure original checkpoints
   still load correctly. Document reasons for model changes.