Karpathy proved a GPT fits in 202 lines of pure Python. No PyTorch. No numpy.
KIRI takes that atom and changes the language. Instead of English, it speaks infrastructure state, work patterns, task sequences. Same architecture. Different vocabulary. Suddenly it's a pattern detector, anomaly finder, and autonomous decision engine — running on a Mac Mini, forever, at zero cost.
Output feeds input. The loop compounds. The system teaches itself.
Posted by Andrej Karpathy on Feb 11, 2026 (code, rendered). A GPT that trains and generates text in 202 lines, 161 lines of actual code. Zero imports beyond os, math, random.
┌──────────────────────────────────────────────────────┐ │ 1. AUTOGRAD ENGINE (Value class) │ │ Every scalar tracks its own gradient. │ │ This IS backpropagation. │ ├──────────────────────────────────────────────────────┤ │ 2. TOKENIZER (character-level) │ │ 26 lowercase letters + BOS = 27 tokens │ │ BOS used as BOTH start and end token │ ├──────────────────────────────────────────────────────┤ │ 3. MODEL (decoder-only transformer) │ │ Embedding → RMSNorm → Attention → MLP → Logits │ │ RMSNorm (not LayerNorm), ReLU (not GeLU) │ │ Separate lm_head (no weight tying) │ ├──────────────────────────────────────────────────────┤ │ 4. TRAINING (Adam optimizer) │ │ β1=0.85, β2=0.99, linear LR decay │ │ Single loss.backward() on averaged sequence loss │ ├──────────────────────────────────────────────────────┤ │ 5. INFERENCE (temperature sampling) │ │ temperature=0.5, generates 20 names │ │ Samples proportionally from distribution │ └──────────────────────────────────────────────────────┘
The Value class wraps a single number. Every operation (+, ×, exp, log, relu, pow) returns a new Value that remembers its parents and the local derivative. Calling .backward() walks the graph in reverse topological order, accumulating gradients via chain rule.
class Value: __slots__ = ('data', 'grad', '_children', '_local_grads') def __init__(self, data, children=(), local_grads=()): self.data = data # the scalar value self.grad = 0 # d(loss)/d(self), filled by backward() self._children = children self._local_grads = local_grads # d(self)/d(child) for each child def __add__(self, other): # d(a+b)/da = 1, d(a+b)/db = 1 return Value(self.data + other.data, (self, other), (1, 1)) def __mul__(self, other): # d(a*b)/da = b, d(a*b)/db = a return Value(self.data * other.data, (self, other), (other.data, self.data)) def backward(self): # Topological sort → reverse walk → chain rule topo, visited = [], set() def build_topo(v): ... # DFS post-order build_topo(self) self.grad = 1 # d(loss)/d(loss) = 1 for v in reversed(topo): for child, lg in zip(v._children, v._local_grads): child.grad += lg * v.grad # THE chain rule
uchars = sorted(set(''.join(docs))) # 26 unique chars from names
BOS = len(uchars) # token 26 = beginning/end
vocab_size = len(uchars) + 1 # 26 chars + BOS = 27 total
Token: 'e' (id=4) Position: 0
│ │
▼ ▼
wte[4] [16-dim] wpe[0] [16-dim] ← Lookup (not multiply)
└──────┬───────────────┘
│ x = tok_emb + pos_emb
▼
RMSNorm(x) ← Pre-norm before first layer
│
┌───────────┼───────────────────────┐
│ TRANSFORMER BLOCK (×1) │
│ │ │
│ RMSNorm → Q, K, V │ Q,K,V,O: each 16×16 matrix
│ │ │
│ 4-head attention │ head_dim = 16/4 = 4
│ (Q·K^T / √4 → softmax → V) │ with KV cache
│ │ │
│ + residual │
│ │ │
│ RMSNorm → MLP │ fc1: 16→64 (4× expand)
│ ReLU activation │ ReLU, not squared ReLU
│ │ │ fc2: 64→16 (compress)
│ + residual │
└───────────┼───────────────────────┘
│
lm_head [27×16] ← Separate matrix (no weight tying)
│
logits [27-dim] → softmax → P(next token)
| Component | Shape | Params | Purpose |
|---|---|---|---|
| wte | 27 × 16 | 432 | Token embeddings |
| wpe | 16 × 16 | 256 | Position embeddings |
| lm_head | 27 × 16 | 432 | Output projection (separate, not tied to wte) |
| attn (wq,wk,wv,wo) | 4 × (16×16) | 1,024 | Query, Key, Value, Output projections |
| mlp (fc1,fc2) | (64×16) + (16×64) | 2,048 | Feed-forward network (4× expansion) |
| TOTAL | 4,192 |
Formula: total = vocab × n_embd // wte: 432 + block_size × n_embd // wpe: 256 + vocab × n_embd // lm_head: 432 + n_layer × 12 × n_embd² // attention + MLP: 3,072 = 4,192
# Training: for each name, predict next character at every position tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS] # BOS on both ends for pos_id in range(n): logits = gpt(tokens[pos_id], pos_id, keys, values) probs = softmax(logits) loss_t = -probs[tokens[pos_id + 1]].log() # cross-entropy loss = (1/n) * sum(losses) loss.backward() # single call — traces through ENTIRE computation graph # Adam: β1=0.85 β2=0.99 (not the usual 0.9/0.999) # Linear LR decay to 0 over training # Inference: temperature-controlled sampling probs = softmax([l / 0.5 for l in logits]) # temperature=0.5 → sharper token_id = random.choices(range(vocab_size), weights=probs)[0] # Generates 20 samples. Names like "marin", "jorah", "kayla"
A microgpt doesn't know it's speaking English. It predicts the next token in a sequence. Change the vocabulary and it predicts system states instead of characters.
ENGLISH ATOM (Karpathy's) STATE ATOM (KIRI) ───────────────────────── ───────────────── vocab: a b c ... z BOS vocab: C0..C9 M0..M9 D0..D9 S0..S4 L0..L4 N0 N1 BOS 27 tokens 43 tokens trains on: "emma" "olivia" trains on: "C5 M5 D4 S1 L1 N1" "C9 M9 D4 S4 L4 N1" predicts: next character predicts: next metric value 4,192 params 27,840 params SAME autograd. SAME attention. SAME training loop. Different vocabulary. Different purpose.
Continuous metrics are quantized into buckets. CPU 0-100% becomes tokens C0 through C9 (10% each). This keeps vocabulary small = model stays tiny.
# Pulse atom schema — monitors a Mac Mini schema = { 'C': (0, 100, 10), # CPU %: 10 buckets → C0 C1 ... C9 'M': (0, 100, 10), # Memory %: 10 buckets 'D': (0, 100, 10), # Disk %: 10 buckets 'S': (0, 100, 5), # Swap %: 5 buckets 'L': (0, 20, 5), # Load average: 5 buckets 'N': (0, 1, 2), # Network: down/up } # Total: 42 metric tokens + BOS = 43 vocab
After training, the model has learned "what usually follows what." When a new observation arrives, compute the average negative log-probability across all tokens. High score = the model is surprised = anomaly.
Normal observation (work hours, moderate load): C5 M5 D4 S1 L1 N1 Model: "Yeah, seen this pattern thousands of times." Average score: 0.72 (low surprise) Anomalous observation (3am, everything maxed): C9 M9 D4 S4 L4 N1 Model: "C9?! M9?! S4?! Never seen these together." Average score: 6.15 (high surprise) Per-token: C9=9.38, M9=12.11, S4=9.19 (near-zero probability)
| Component | Shape | Params |
|---|---|---|
| wte | 43 × 32 | 1,376 |
| wpe | 16 × 32 | 512 |
| lm_head | 43 × 32 | 1,376 |
| 2 layers × (attn + mlp) | 2 × 12 × 32² | 24,576 |
| TOTAL | 27,840 |
An atom alone detects patterns. Two atoms piped together make decisions. The output of one feeds the input of the next. Loop it and the system teaches itself.
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ COLLECT │────→│ ATOM │────→│ DECIDE │────→│ ACT │ │ (metrics)│ │ (predict │ │ (score │ │ (alert │ │ │ │ + score)│ │ → action)│ │ or log) │ └─────────┘ └──────────┘ └──────────┘ └──────────┘ Data flows left to right. Each stage is a Python function. This alone = monitoring + alerting. Already useful.
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ COLLECT │────→│ ATOM │────→│ DECIDE │────→│ ACT │
└─────────┘ └──────────┘ └──────────┘ └────┬─────┘
▲ │
│ FEEDBACK LOOP │
└──────────────────────────────────────────────────┘
result → next input
Now it learns from its own actions.
You dismissed an alert → feeds back as training data.
Next time → auto-suppresses that pattern.
THE SYSTEM IMPROVES BY RUNNING.
Infra health. Mac Mini stats + MikroTik. "Is this system state normal?"
Work patterns. Keyboard/mouse idle time, git activity, focus blocks. "Is this a normal work day?"
Task state. Tasks added/completed, project switches. "Is scope creeping?"
Meta-model. Trained on OTHER atoms' outputs + your responses. "What action should I take?"
| Time | What Happens |
|---|---|
| Week 1 | Independent atoms, independent alerts. 3 pings for 1 situation. Annoying but data collecting. |
| Week 3 | Nerve connects patterns. "When Rhythm=no-activity AND Drift=scope-creep → suppress Rhythm, surface Drift." 1 smart alert instead of 3. |
| Month 2 | Cross-domain insights. "Morning habits skipped → afternoon output drops 60%. Nudge at 7am." Predictions based on YOUR data. |
| Month 6 | Partial automation. Auto-firewall rules, auto-invoice reminders. Approve/reject trains the boundary of when to auto-act vs ask. |
# The action vocabulary includes meta-actions: actions = { 'ok': do_nothing, 'alert': send_telegram, 'suppress': mark_false_alarm, 'retrain': retrain_atom, # system retrains itself 'spawn': create_new_atom, # system creates new atoms } # Nerve generates "retrain:pulse:7d" → Pulse retrains on recent data # → predictions improve → Nerve's decisions improve → compounds
These are actual outputs from training on a Mac Mini (24GB, Apple Silicon).
loaded 2128 observations from 8 files atom: 27,840 params | vocab 43 step 1/500 | loss 3.9828 ← random weights, knows nothing step 50/500 | loss 0.3095 ← learning fast step 250/500 | loss 0.7080 ← fluctuation normal (batch size 1) step 500/500 | loss 0.5924 ← converged saved weights → pulse_weights.json
Normal (work hours, moderate load): C5 M5 D4 S1 L1 N1 Average surprise: 0.72 Per token: all within expected range Anomalous (3am, everything maxed): C9 M9 D4 S4 L4 N1 Average surprise: 7.98 Per token: C9=9.38 M9=12.11 S4=9.19 ← "never seen this" Anomalous is 11× more surprising than normal. The model identifies WHICH metrics are unusual and by how much.
$ python3 -m kiri.atoms.pulse.collect --interval 1 --duration 600 collecting 600 observations over 600s (every 1s) 1/600 | C=15% M=74% D=41% ← real Mac Mini stats 100/600 | C=15% M=74% D=41% ← collected via os/subprocess 200/600 | C=14% M=73% D=42% ← zero dependencies saved 276 observations across 1 files
| Metric | Value |
|---|---|
| Model size | 27,840 params (<1MB on disk) |
| Training time (500 steps) | ~8min pure Python, 17.6s PyTorch/MPS (27x faster) |
| Inference time | ~100ms per observation |
| Data collection | 1/sec (fast blast) to 1/5min (steady state) |
| Dependencies | 0. Python 3 stdlib only. |
| API costs | KES 0. Forever. |
| RAM usage | <50MB (Mac Mini has 24GB) |
| Codebase | ~500 lines total across all modules |
What this can and cannot do. No hand-waving.
Learn repeating patterns in structured sequences. Detect when new observations don't fit learned patterns. Get better with more data. Run forever on zero resources. Compose atoms via pipes for multi-domain awareness.
Remember across sequences (16-token window only). Understand causation (knows "unusual" not "why"). See slow trends (disk filling over weeks). Multivariate reasoning (learns sequential patterns, not true correlations). Handle natural language. Replace a real LLM for complex reasoning.
| Constraint | Impact | Workaround |
|---|---|---|
| Context window: 16 tokens | Can't see patterns spanning hours/days | Encode longer windows as summary tokens. Or use bigger block_size (costs more params). |
| 10% bucket granularity | CPU 41% and 49% are the same token (C4) | More buckets = more vocab = more params. Trade-off is configurable. |
| Pure Python speed | Training is ~27× slower than PyTorch/MPS | Use AtomTorch for fast training (17.6s vs ~8min). Pure Python works everywhere with zero dependencies. |
| No causation | Flags anomaly, can't explain it | The Pipe + your response IS the explanation loop. Over time, Nerve learns cause→effect from YOUR feedback. |
| Sequential token processing | Can't truly correlate CPU↔Memory simultaneously | Learns "C5 usually followed by M5" as sequence pattern. Works in practice, not in theory. |
Phase 0: Understood microgpt. Every line, every gradient.
Phase 1: Extracted core modules. Pulse atom collecting real Mac Mini data. MikroTik REST API collector. Training works. Anomaly detection works (0.72 normal vs 7.98 anomalous = 11x differentiation).
Phase 2: Rhythm atom. Keyboard/mouse idle time via ioreg HIDIdleTime. Learns work patterns, flags 3am Sunday activity.
Phase 3: Drift atom. Manual CLI task logging. Detects scope creep (4.06 vs 0.47 = 8.6x). 8 tasks added / 0 completed / 5 switches = anomaly.
Phase 4: Nerve meta-model. Trained on other atoms' scores + user feedback. Predicts: ok, alert, suppress, retrain. Action vocabulary with feedback loop.
Phase 5: PyTorch/MPS acceleration. AtomTorch drop-in replacement. 500 steps in 17.6s (27x faster). Same anomaly detection quality.
Phase 6: Full daemon. Scheduled collection, all 4 atoms scoring, Nerve decisions, Telegram alerts.
Run the daemon on a Mac Mini for weeks. Collect real data across all atoms. Retrain on accumulated observations. Let Nerve learn from real approve/dismiss feedback.
Network security atom (MikroTik firewall logs). Financial patterns (M-Pesa/bank transactions). Phone integration (activity patterns). Sleep/energy inference from idle time data. Focus scoring from mouse movement patterns. Auto-retraining triggered by Nerve.