from karpathy's microgpt · open source

KIRI — Atoms that loop

Karpathy proved a GPT fits in 202 lines of pure Python. No PyTorch. No numpy.

KIRI takes that atom and changes the language. Instead of English, it speaks infrastructure state, work patterns, task sequences. Same architecture. Different vocabulary. Suddenly it's a pattern detector, anomaly finder, and autonomous decision engine — running on a Mac Mini, forever, at zero cost.

Output feeds input. The loop compounds. The system teaches itself.

4,192 params (microgpt) 27,840 params (pulse atom) 0 dependencies pure python

Pipeline Trace → Live Forward Pass → 3D Parameter Architecture →

microgpt.py — The Complete Algorithm

Posted by Andrej Karpathy on Feb 11, 2026 (code, rendered). A GPT that trains and generates text in 202 lines, 161 lines of actual code. Zero imports beyond os, math, random.

202

Total Lines

4,192

Parameters

Dependencies

Vocab (chars)

The 5 Parts

┌──────────────────────────────────────────────────────┐
│  1. AUTOGRAD ENGINE (Value class)                     │
│     Every scalar tracks its own gradient.             │
│     This IS backpropagation.                          │
├──────────────────────────────────────────────────────┤
│  2. TOKENIZER (character-level)                       │
│     26 lowercase letters + BOS = 27 tokens            │
│     BOS used as BOTH start and end token              │
├──────────────────────────────────────────────────────┤
│  3. MODEL (decoder-only transformer)                  │
│     Embedding → RMSNorm → Attention → MLP → Logits   │
│     RMSNorm (not LayerNorm), ReLU (not GeLU)          │
│     Separate lm_head (no weight tying)                │
├──────────────────────────────────────────────────────┤
│  4. TRAINING (Adam optimizer)                         │
│     β1=0.85, β2=0.99, linear LR decay                │
│     Single loss.backward() on averaged sequence loss  │
├──────────────────────────────────────────────────────┤
│  5. INFERENCE (temperature sampling)                  │
│     temperature=0.5, generates 20 names               │
│     Samples proportionally from distribution          │
└──────────────────────────────────────────────────────┘

Part 1: Autograd Engine

The Value class wraps a single number. Every operation (+, ×, exp, log, relu, pow) returns a new Value that remembers its parents and the local derivative. Calling .backward() walks the graph in reverse topological order, accumulating gradients via chain rule.

class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads')

    def __init__(self, data, children=(), local_grads=()):
        self.data = data          # the scalar value
        self.grad = 0             # d(loss)/d(self), filled by backward()
        self._children = children
        self._local_grads = local_grads  # d(self)/d(child) for each child

    def __add__(self, other):
        # d(a+b)/da = 1, d(a+b)/db = 1
        return Value(self.data + other.data, (self, other), (1, 1))

    def __mul__(self, other):
        # d(a*b)/da = b, d(a*b)/db = a
        return Value(self.data * other.data, (self, other), (other.data, self.data))

    def backward(self):
        # Topological sort → reverse walk → chain rule
        topo, visited = [], set()
        def build_topo(v): ...     # DFS post-order
        build_topo(self)
        self.grad = 1              # d(loss)/d(loss) = 1
        for v in reversed(topo):
            for child, lg in zip(v._children, v._local_grads):
                child.grad += lg * v.grad  # THE chain rule

This is the entire backpropagation algorithm. Every neural network framework — PyTorch, JAX, TensorFlow — does exactly this, but on tensors instead of scalars. Understanding these 40 lines = understanding how all deep learning trains.

Part 2: Tokenizer

uchars = sorted(set(''.join(docs)))   # 26 unique chars from names
BOS = len(uchars)                     # token 26 = beginning/end
vocab_size = len(uchars) + 1          # 26 chars + BOS = 27 total

BOS serves as both start AND end. No separate EOS token. The model learns "after the last character of a name, BOS comes next" — BOS signals both "start generating" and "stop generating."

Part 3: Model Architecture

  Token: 'e' (id=4)     Position: 0
       │                      │
       ▼                      ▼
  wte[4] [16-dim]        wpe[0] [16-dim]     ← Lookup (not multiply)
       └──────┬───────────────┘
              │ x = tok_emb + pos_emb
              ▼
         RMSNorm(x)                           ← Pre-norm before first layer
              │
  ┌───────────┼───────────────────────┐
  │     TRANSFORMER BLOCK (×1)        │
  │           │                       │
  │      RMSNorm → Q, K, V           │  Q,K,V,O: each 16×16 matrix
  │           │                       │
  │      4-head attention             │  head_dim = 16/4 = 4
  │      (Q·K^T / √4 → softmax → V)  │  with KV cache
  │           │                       │
  │      + residual                   │
  │           │                       │
  │      RMSNorm → MLP               │  fc1: 16→64 (4× expand)
  │      ReLU activation              │  ReLU, not squared ReLU
  │           │                       │  fc2: 64→16 (compress)
  │      + residual                   │
  └───────────┼───────────────────────┘
              │
         lm_head [27×16]              ← Separate matrix (no weight tying)
              │
         logits [27-dim] → softmax → P(next token)

Parameter Count (validated)

Component	Shape	Params	Purpose
wte	27 × 16	432	Token embeddings
wpe	16 × 16	256	Position embeddings
lm_head	27 × 16	432	Output projection (separate, not tied to wte)
attn (wq,wk,wv,wo)	4 × (16×16)	1,024	Query, Key, Value, Output projections
mlp (fc1,fc2)	(64×16) + (16×64)	2,048	Feed-forward network (4× expansion)
TOTAL		4,192

Formula:
total = vocab × n_embd          // wte: 432
      + block_size × n_embd     // wpe: 256
      + vocab × n_embd          // lm_head: 432
      + n_layer × 12 × n_embd²  // attention + MLP: 3,072
      = 4,192

Part 4 & 5: Training & Inference

# Training: for each name, predict next character at every position
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]  # BOS on both ends
for pos_id in range(n):
    logits = gpt(tokens[pos_id], pos_id, keys, values)
    probs = softmax(logits)
    loss_t = -probs[tokens[pos_id + 1]].log()   # cross-entropy
loss = (1/n) * sum(losses)
loss.backward()   # single call — traces through ENTIRE computation graph

# Adam: β1=0.85 β2=0.99 (not the usual 0.9/0.999)
# Linear LR decay to 0 over training

# Inference: temperature-controlled sampling
probs = softmax([l / 0.5 for l in logits])  # temperature=0.5 → sharper
token_id = random.choices(range(vocab_size), weights=probs)[0]
# Generates 20 samples. Names like "marin", "jorah", "kayla"

The Atom — microgpt Speaks State

A microgpt doesn't know it's speaking English. It predicts the next token in a sequence. Change the vocabulary and it predicts system states instead of characters.

  ENGLISH ATOM (Karpathy's)              STATE ATOM (KIRI)
  ─────────────────────────              ─────────────────
  vocab: a b c ... z BOS                 vocab: C0..C9 M0..M9 D0..D9 S0..S4 L0..L4 N0 N1 BOS
  27 tokens                              43 tokens
  trains on: "emma" "olivia"             trains on: "C5 M5 D4 S1 L1 N1" "C9 M9 D4 S4 L4 N1"
  predicts: next character               predicts: next metric value
  4,192 params                           27,840 params

  SAME autograd. SAME attention. SAME training loop.
  Different vocabulary. Different purpose.

State Language

Continuous metrics are quantized into buckets. CPU 0-100% becomes tokens C0 through C9 (10% each). This keeps vocabulary small = model stays tiny.

# Pulse atom schema — monitors a Mac Mini
schema = {
    'C': (0, 100, 10),   # CPU %: 10 buckets → C0 C1 ... C9
    'M': (0, 100, 10),   # Memory %: 10 buckets
    'D': (0, 100, 10),   # Disk %: 10 buckets
    'S': (0, 100, 5),    # Swap %: 5 buckets
    'L': (0, 20, 5),     # Load average: 5 buckets
    'N': (0, 1, 2),      # Network: down/up
}
# Total: 42 metric tokens + BOS = 43 vocab

Anomaly = Surprise

After training, the model has learned "what usually follows what." When a new observation arrives, compute the average negative log-probability across all tokens. High score = the model is surprised = anomaly.

  Normal observation (work hours, moderate load):
  C5 M5 D4 S1 L1 N1
  Model: "Yeah, seen this pattern thousands of times."
  Average score: 0.72 (low surprise)

  Anomalous observation (3am, everything maxed):
  C9 M9 D4 S4 L4 N1
  Model: "C9?! M9?! S4?! Never seen these together."
  Average score: 6.15 (high surprise)
  Per-token: C9=9.38, M9=12.11, S4=9.19 (near-zero probability)

The model doesn't need rules. No "if CPU > 90% then alert." It learns what's normal FROM YOUR DATA and flags anything that doesn't fit. It adapts as your patterns change — just retrain.

Pulse Atom Param Count (validated)

Component	Shape	Params
wte	43 × 32	1,376
wpe	16 × 32	512
lm_head	43 × 32	1,376
2 layers × (attn + mlp)	2 × 12 × 32²	24,576
TOTAL		27,840

Composition — Atoms that Loop

An atom alone detects patterns. Two atoms piped together make decisions. The output of one feeds the input of the next. Loop it and the system teaches itself.

The Pipe (Linear)

  ┌─────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │ COLLECT  │────→│  ATOM    │────→│  DECIDE  │────→│   ACT    │
  │ (metrics)│     │ (predict │     │ (score   │     │ (alert   │
  │          │     │  + score)│     │  → action)│    │  or log) │
  └─────────┘     └──────────┘     └──────────┘     └──────────┘
  
  Data flows left to right. Each stage is a Python function.
  This alone = monitoring + alerting. Already useful.

The Loop (Circular)

  ┌─────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │ COLLECT  │────→│  ATOM    │────→│  DECIDE  │────→│   ACT    │
  └─────────┘     └──────────┘     └──────────┘     └────┬─────┘
       ▲                                                  │
       │              FEEDBACK LOOP                       │
       └──────────────────────────────────────────────────┘
                   result → next input

  Now it learns from its own actions.
  You dismissed an alert → feeds back as training data.
  Next time → auto-suppresses that pattern.
  THE SYSTEM IMPROVES BY RUNNING.

The Molecule (Multiple Atoms)

Pulse

Infra health. Mac Mini stats + MikroTik. "Is this system state normal?"

Rhythm

Work patterns. Keyboard/mouse idle time, git activity, focus blocks. "Is this a normal work day?"

Drift

Task state. Tasks added/completed, project switches. "Is scope creeping?"

Nerve

Meta-model. Trained on OTHER atoms' outputs + your responses. "What action should I take?"

Nerve is where RL emerges naturally. Your approve/dismiss responses are reward signals. Nerve learns which actions lead to approvals (reward +1) vs dismissals (reward -1). That's policy learning — same architecture, just trained on action-result pairs instead of state sequences.

The Compound Effect

Time	What Happens
Week 1	Independent atoms, independent alerts. 3 pings for 1 situation. Annoying but data collecting.
Week 3	Nerve connects patterns. "When Rhythm=no-activity AND Drift=scope-creep → suppress Rhythm, surface Drift." 1 smart alert instead of 3.
Month 2	Cross-domain insights. "Morning habits skipped → afternoon output drops 60%. Nudge at 7am." Predictions based on YOUR data.
Month 6	Partial automation. Auto-firewall rules, auto-invoice reminders. Approve/reject trains the boundary of when to auto-act vs ask.

Self-Modification

# The action vocabulary includes meta-actions:
actions = {
    'ok':       do_nothing,
    'alert':    send_telegram,
    'suppress': mark_false_alarm,
    'retrain':  retrain_atom,      # system retrains itself
    'spawn':    create_new_atom,    # system creates new atoms
}
# Nerve generates "retrain:pulse:7d" → Pulse retrains on recent data
# → predictions improve → Nerve's decisions improve → compounds

Real Results — Not Theory

These are actual outputs from training on a Mac Mini (24GB, Apple Silicon).

Training Run

Pulse Atom — 500 Steps on Real + Synthetic Data

loaded 2128 observations from 8 files
atom: 27,840 params | vocab 43

step    1/500 | loss 3.9828    ← random weights, knows nothing
step   50/500 | loss 0.3095    ← learning fast
step  250/500 | loss 0.7080    ← fluctuation normal (batch size 1)
step  500/500 | loss 0.5924    ← converged

saved weights → pulse_weights.json

Anomaly Detection

Per-Token Anomaly Scoring

Normal (work hours, moderate load):
  C5 M5 D4 S1 L1 N1
  Average surprise: 0.72
  Per token: all within expected range

Anomalous (3am, everything maxed):
  C9 M9 D4 S4 L4 N1
  Average surprise: 7.98
  Per token: C9=9.38 M9=12.11 S4=9.19 ← "never seen this"

Anomalous is 11× more surprising than normal.
The model identifies WHICH metrics are unusual and by how much.

Data Collection

Live Mac Mini Collection

$ python3 -m kiri.atoms.pulse.collect --interval 1 --duration 600

collecting 600 observations over 600s (every 1s)
  1/600 | C=15% M=74% D=41%     ← real Mac Mini stats
  100/600 | C=15% M=74% D=41%   ← collected via os/subprocess
  200/600 | C=14% M=73% D=42%   ← zero dependencies
saved 276 observations across 1 files

The Numbers

Metric	Value
Model size	27,840 params (<1MB on disk)
Training time (500 steps)	~8min pure Python, 17.6s PyTorch/MPS (27x faster)
Inference time	~100ms per observation
Data collection	1/sec (fast blast) to 1/5min (steady state)
Dependencies	0. Python 3 stdlib only.
API costs	KES 0. Forever.
RAM usage	<50MB (Mac Mini has 24GB)
Codebase	~500 lines total across all modules

Honest Limits

What this can and cannot do. No hand-waving.

Can Do

Learn repeating patterns in structured sequences. Detect when new observations don't fit learned patterns. Get better with more data. Run forever on zero resources. Compose atoms via pipes for multi-domain awareness.

Cannot Do

Remember across sequences (16-token window only). Understand causation (knows "unusual" not "why"). See slow trends (disk filling over weeks). Multivariate reasoning (learns sequential patterns, not true correlations). Handle natural language. Replace a real LLM for complex reasoning.

Specific Constraints

Constraint	Impact	Workaround
Context window: 16 tokens	Can't see patterns spanning hours/days	Encode longer windows as summary tokens. Or use bigger block_size (costs more params).
10% bucket granularity	CPU 41% and 49% are the same token (C4)	More buckets = more vocab = more params. Trade-off is configurable.
Pure Python speed	Training is ~27× slower than PyTorch/MPS	Use AtomTorch for fast training (17.6s vs ~8min). Pure Python works everywhere with zero dependencies.
No causation	Flags anomaly, can't explain it	The Pipe + your response IS the explanation loop. Over time, Nerve learns cause→effect from YOUR feedback.
Sequential token processing	Can't truly correlate CPU↔Memory simultaneously	Learns "C5 usually followed by M5" as sequence pattern. Works in practice, not in theory.

The architecture compensates for individual atom weakness. One atom is a pattern matcher with short memory. Four atoms piped through Nerve with feedback = a system that learns cause-effect across domains over weeks. The loop is smarter than any single model.

What's Next

Done ✓

Phase 0: Understood microgpt. Every line, every gradient.

Phase 1: Extracted core modules. Pulse atom collecting real Mac Mini data. MikroTik REST API collector. Training works. Anomaly detection works (0.72 normal vs 7.98 anomalous = 11x differentiation).

Phase 2: Rhythm atom. Keyboard/mouse idle time via ioreg HIDIdleTime. Learns work patterns, flags 3am Sunday activity.

Phase 3: Drift atom. Manual CLI task logging. Detects scope creep (4.06 vs 0.47 = 8.6x). 8 tasks added / 0 completed / 5 switches = anomaly.

Phase 4: Nerve meta-model. Trained on other atoms' scores + user feedback. Predicts: ok, alert, suppress, retrain. Action vocabulary with feedback loop.

Phase 5: PyTorch/MPS acceleration. AtomTorch drop-in replacement. 500 steps in 17.6s (27x faster). Same anomaly detection quality.

Phase 6: Full daemon. Scheduled collection, all 4 atoms scoring, Nerve decisions, Telegram alerts.

Production Hardening

Run the daemon on a Mac Mini for weeks. Collect real data across all atoms. Retrain on accumulated observations. Let Nerve learn from real approve/dismiss feedback.

Ideas for Later

Network security atom (MikroTik firewall logs). Financial patterns (M-Pesa/bank transactions). Phone integration (activity patterns). Sleep/energy inference from idle time data. Focus scoring from mouse movement patterns. Auto-retraining triggered by Nerve.

Open Source

KIRI is open source. The core idea — microgpt trained on state tokens for anomaly detection — belongs to everyone. The architecture (atoms, pipes, loops, self-retraining) is general enough that anyone with a computer can build their own composable intelligence system.

KIRI — Intelligent Runtime Inspector
Built on Karpathy's microgpt · Open Source · 2026