AddLM — a multiplication-free language model

LLM

Quantization

Transformers

Research

An open experiment in training transformer language models without real-number matrix multiplication — ternary weights plus tropical (additive) attention. At 25 M parameters, it beats the matched float baseline on tinyshakespeare.

Author

Kader Mohideen

Published

May 7, 2026

Summary

An open experiment in training transformer language models without real-number matrix multiplication. AddLM combines two ideas:

Ternary weights {−1, 0, +1} — BitNet-1.58 style, straight-through estimator at train time.
Tropical attention — scores computed as −L1(q, k) instead of q · k, with top-k routing instead of dense softmax.

At inference, every linear projection becomes a signed sum and every attention score is a sum of absolute differences. No real-number multiplications anywhere in the forward pass.

Key result

Model size	Float baseline (val loss)	AddLM (val loss)	Gap
0.82 M params	1.88	2.11	+13.0%
4.8 M params	1.56	1.63	+4.4%
25 M params	1.68	1.60	−4.3% ✓

At 25 M parameters AddLM beats the matched float transformer on validation loss. The ternary weight constraint acts as a regularizer that prevents the overfitting the float model suffers at this scale.

Why it matters

The original transformer spends ~99% of its compute on multiplications inside Q K^T, the output projections, and the FFN. On modern hardware these multiplications dominate energy use. If language models can be trained with the AddLM algebra, then on dedicated silicon (FPGAs, ASICs, or future low-bit accelerators) inference becomes:

~30× cheaper per operation — signed adders vs floating-point multipliers
~16× smaller weight storage — 1.6 bits per weight vs 32
No multiplier circuitry needed at all for the main matmul-heavy paths

This extends the BitNet b1.58 direction by also removing multiplication from attention scoring.

Architecture at a glance

Component	Original transformer	AddLM
Linear projections (Q, K, V, output, FFN)	`nn.Linear` (float)	`TernaryLinear`
Attention score	`Q K^T / √d`	`−L1(Q, K)`
Attention norm	softmax over all positions	softmax over top-k (k = 16)
FFN activation	GELU	ReLU
Embeddings & final head	float	float (kept as-is)
LayerNorm	float	float (kept as-is)

The forward pass uses only +, −, max, and sign-flips.

Ablation — which change costs what

At 4.8 M params, 2 500 steps on tinyshakespeare:

Variant	Val loss	Cost vs Float
Float baseline	1.595	—
Tropical attention only	1.626	+0.031
Ternary weights only	1.726	+0.131
Full AddLM	1.776	+0.181

Removing multiplication from attention scoring is essentially free. Almost all of AddLM’s training-loss cost is in the ternary weight constraint, not the attention rewrite — useful signal for accelerator design.

What’s in the repo

addlm/
├── src/addlm.py                  # standalone library: TernaryLinear, TropicalAttention, HybridGPT
├── notebooks/
│   ├── 01_float_baseline.ipynb         # original transformer reference
│   ├── 02_addlm_vs_float_small.ipynb   # 0.8 M head-to-head
│   ├── 03_addlm_scaled.ipynb           # 4.8 M, 5 k steps
│   └── 04_experiments.ipynb            # 25 M run + ablations + top-k sweep
└── results/                       # per-experiment writeups

Four Colab-runnable notebooks reproducing every number in the paper. No setup required — Runtime → Run all.

Caveats

No real wall-clock speedup yet. PyTorch runs everything through matmul kernels with weights constrained to ternary. Capturing the speedup needs a custom CUDA / Triton kernel.
One dataset. Char-level tinyshakespeare. Regularization advantage may not transfer identically to larger corpora or BPE tokenization.
Embeddings and final head are still float. Easy follow-up.
Short context (T = 128). The regime where top-k routing should pay off most isn’t exercised here.

Open questions

50 M – 200 M parameters. Does the AddLM lead widen at scale?
A real ternary Triton kernel. Predicted speedup: 10–30× on inference.
Long-context tropical attention. At T = 4 096+, top-8 routing should drastically beat dense softmax.
Better tropical scores. The true tropical inner product max_i (q_i + k_i) is even cheaper than −L1 — does it train?
Different corpora. Wikipedia, code, conversation.

Tech stack

Framework: PyTorch 2.0+
Hardware: A100 (Colab), single GPU
Dataset: Karpathy’s tinyshakespeare (1.1 M chars, char-level)
Inspiration: Microsoft’s BitNet b1.58