Run KIRI, understand the code line by line, watch it score live metrics, check if the model is actually learning, and see where this architecture goes next.
KIRI has zero external dependencies. Python 3 stdlib only. PyTorch is optional (for GPU acceleration).
~/Binnode/kiri/ core/ value.py # scalar autograd engine (backpropagation) language.py # quantizes continuous values into token buckets atom.py # decoder-only transformer + Adam optimizer atom_torch.py # PyTorch port (MPS accelerated, 27x faster) pipe.py # linear composition engine atoms/ pulse/ # infrastructure metrics (CPU, mem, disk, load) rhythm/ # work patterns (keyboard/mouse idle time) drift/ # task patterns (tasks added/completed/switched) nerve/ # decision engine (aggregates atom scores) data/ # collected JSONL files (one per day) server.py # HTTP API server (port 7745) kiri.py # daemon: collect + score + decide + act config.py # global configuration
Before training, you need data. Generate a week of synthetic observations to test with.
# From ~/Binnode (parent of kiri/): python3 -m kiri.atoms.pulse.collect --dry-run # 7 days synthetic infra data python3 -m kiri.atoms.rhythm.collect --dry-run # 7 days synthetic work data python3 -m kiri.atoms.drift.collect --dry-run # 7 days synthetic task data python3 -m kiri.atoms.nerve.collect --dry-run # 7 days synthetic nerve data # Files land in kiri/data/pulse_2026-02-07.jsonl, etc. # Each line is one JSON observation: {"C":52,"M":55,"D":40,...}
Each atom trains on its own data files. 500 steps takes ~8 minutes in pure Python, ~18 seconds with PyTorch/MPS.
python3 -m kiri.atoms.pulse.train --data 'kiri/data/pulse_*.jsonl' --steps 500 --verbose python3 -m kiri.atoms.rhythm.train --data 'kiri/data/rhythm_*.jsonl' --steps 500 --verbose python3 -m kiri.atoms.drift.train --data 'kiri/data/drift_*.jsonl' --steps 500 --verbose python3 -m kiri.atoms.nerve.train --data 'kiri/data/nerve_*.jsonl' --steps 500 --verbose # Weights saved to atoms/pulse/weights/pulse_weights.json, etc. # --verbose runs anomaly comparison after training
Collect real metrics from your machine. Single shot or continuous.
# Single observation (instant): python3 -m kiri.atoms.pulse.collect # Continuous — every 5s for 1 hour: python3 -m kiri.atoms.pulse.collect --interval 5 --duration 3600 # Blast mode — every 1s for 10 minutes: python3 -m kiri.atoms.pulse.collect --interval 1 --duration 600
The API server serves status, history, training, and collection endpoints.
# Start server on port 7745: python3 -m kiri.server --port 7745 # With background collection (every 30s): python3 -m kiri.server --collect --interval 30 # API endpoints: # GET /api/status → live collect + score # GET /api/history?n=100 → scored historical observations # POST /api/train?atom=pulse&steps=300 → streaming NDJSON # GET /api/collect → single collection cycle
The full daemon collects from all sources, scores through each atom, and lets Nerve decide what to do.
python3 -m kiri.kiri # Runs forever: collect → score → decide → act → loop # Decisions: ok (log), alert (Telegram), suppress, retrain
python3 -m kiri.atoms.drift.collect --added 3 --completed 1 --switched 2 # Logs: 3 tasks added, 1 completed, 2 project switches # Drift detects scope creep patterns from these numbers
| Operation | Behavior | Safe to repeat? |
|---|---|---|
| collect | Appends to daily JSONL file | Yes — never overwrites |
| train | Overwrites weight files | Yes — weights are replaced atomically |
| score | Read-only forward pass | Yes — pure computation, no side effects |
| dry-run | Overwrites synthetic data files | Yes — same seed produces same data |
# Create a virtualenv with PyTorch: python3 -m venv kiri-env source kiri-env/bin/activate pip install torch # AtomTorch is a drop-in replacement for Atom: from kiri.core import AtomTorch # uses MPS on Apple Silicon # 500 steps: 17.6s (PyTorch/MPS) vs ~8min (pure Python) = 27x faster
Seven sections walk through the entire architecture. Each section shows code on top, plain English explanation below, and a Canvas visualization where it helps. Use Prev/Next to walk through.
Karpathy's microgpt trains on baby names. 26 lowercase letters + a BOS token = 27 vocabulary. KIRI replaces this with state tokens: CPU buckets, memory buckets, load average, etc.
docs = ["emma","olivia","ava"]
uchars = sorted(set(''.join(docs)))
# ['a','e','i','l','m','o','v']
BOS = len(uchars) # 7
vocab_size = 8 # 7 chars + BOSschema = {
'C': (0,100,10), # CPU: C0..C9
'M': (0,100,10), # Mem: M0..M9
'D': (0,100,10), # Disk: D0..D9
'S': (0,100, 5), # Swap: S0..S4
'L': (0, 20, 5), # Load: L0..L4
'N': (0, 1, 2), # Net: N0 N1
}
# 42 tokens + BOS = 43 vocabThe quantization trick: CPU 52% → bucket 5 → token "C5". Memory 73% → bucket 7 → "M7". This keeps vocabulary small (43 tokens instead of infinite continuous values) so the model stays tiny.
Every neural network needs backpropagation. The Value class implements it in ~40 lines. Each Value wraps a scalar number and tracks how it was created, so gradients can flow backwards through any computation.
class Value: __slots__ = ('data', 'grad', '_children', '_local_grads') def __init__(self, data, children=(), local_grads=()): self.data = data # the scalar value (a float) self.grad = 0 # d(loss)/d(self) — filled by backward() self._children = children # parent Values that created this one self._local_grads = local_grads # d(self)/d(child) for each parent
Every operation creates a new Value that remembers its parents:
def __add__(self, other):
# d(a+b)/da = 1, d(a+b)/db = 1
return Value(self.data + other.data, (self, other), (1, 1))
def __mul__(self, other):
# d(a*b)/da = b, d(a*b)/db = a
return Value(self.data * other.data, (self, other), (other.data, self.data))
def log(self):
# d(log(x))/dx = 1/x
return Value(math.log(self.data), (self,), (1/self.data,))
def relu(self):
# d(relu(x))/dx = 1 if x > 0 else 0
return Value(max(0, self.data), (self,), (float(self.data > 0),))backward() walks the computation graph in reverse, applying the chain rule:
def backward(self):
topo, visited = [], set()
def build_topo(v): # DFS post-order traversal
if v not in visited:
visited.add(v)
for child in v._children: build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1 # d(loss)/d(loss) = 1
for v in reversed(topo):
for child, lg in zip(v._children, v._local_grads):
child.grad += lg * v.grad # THE chain ruleThe model starts as random numbers. Each weight matrix is initialized with small Gaussian noise (std=0.08). These numbers will be adjusted by training until the model can predict the next token.
class Atom: def __init__(self, lang, n_embd=32, n_head=4, n_layer=2, block_size=16): # Random matrix: nout rows, nin columns, each Value(gauss(0, 0.08)) mat = lambda nout, nin, std=0.08: [ [Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout) ] self.sd = { 'wte': mat(43, 32), # token embeddings: 43 vocab × 32 dims 'wpe': mat(16, 32), # position embeddings: 16 positions × 32 dims 'lm_head': mat(43, 32), # output projection: 43 vocab × 32 dims } # Per layer: 4 attention matrices + 2 MLP matrices for i in range(2): # 2 layers self.sd[f'l{i}.wq'] = mat(32, 32) # query projection self.sd[f'l{i}.wk'] = mat(32, 32) # key projection self.sd[f'l{i}.wv'] = mat(32, 32) # value projection self.sd[f'l{i}.wo'] = mat(32, 32) # output projection self.sd[f'l{i}.f1'] = mat(128, 32) # MLP expand (4×) self.sd[f'l{i}.f2'] = mat(32, 128) # MLP compress
| Component | Shape | Params | Purpose |
|---|---|---|---|
| wte | 43 × 32 | 1,376 | Token embeddings |
| wpe | 16 × 32 | 512 | Position embeddings |
| lm_head | 43 × 32 | 1,376 | Output projection |
| attention | 2 layers × 4 × (32×32) | 16,384 | Q, K, V, O per layer |
| MLP | 2 layers × (128×32 + 32×128) | 8,192 | Feed-forward network |
| Total | 27,840 |
Given a token, predict the probability distribution over the next token. This is the core computation that both training and scoring use.
def forward(self, token_id, pos_id, keys, values):
sd, hd, nh = self.sd, self.head_dim, self.n_head
# 1. Look up token embedding + position embedding
x = [t + p for t, p in zip(sd['wte'][token_id], sd['wpe'][pos_id])]
x = self._rmsnorm(x) # normalize: x / sqrt(mean(x^2))
# 2. For each transformer layer:
for li in range(self.n_layer):
xr = x # save for residual connection
# Multi-head attention
x = self._rmsnorm(x)
q = self._linear(x, sd[f'l{li}.wq']) # query: "what am I looking for?"
k = self._linear(x, sd[f'l{li}.wk']) # key: "what do I contain?"
v = self._linear(x, sd[f'l{li}.wv']) # value: "what info do I carry?"
# KV cache: remember keys/values from all positions
keys[li].append(k)
values[li].append(v)
# 4-head attention: each head sees 8 dims (32/4)
xa = []
for h in range(nh):
# Attention scores: Q · K^T / sqrt(head_dim)
al = [sum(qh[j]*kh[t][j]...) / hd**0.5 ...]
aw = self._softmax(al) # attention weights
ho = [sum(aw[t]*vh[t][j]...) ...] # weighted sum
xa.extend(ho)
x = self._linear(xa, sd[f'l{li}.wo'])
x = [a + b for a, b in zip(x, xr)] # + residual
# MLP: expand 4x, ReLU, compress back
xr = x
x = self._rmsnorm(x)
x = self._linear(x, sd[f'l{li}.f1']) # 32 → 128
x = [xi.relu() for xi in x] # non-linearity
x = self._linear(x, sd[f'l{li}.f2']) # 128 → 32
x = [a + b for a, b in zip(x, xr)] # + residual
# 3. Project to vocabulary logits
return self._linear(x, sd['lm_head']) # 32 → 43 logitsFor each token in a sequence, predict the next one. The loss is the average negative log-probability of the correct answers. Backpropagate. Update weights with Adam.
def train_step(self, token_sequence, lr=0.01):
n = min(self.block_size, len(token_sequence) - 1)
keys = [[] for _ in range(self.n_layer)]
vals = [[] for _ in range(self.n_layer)]
losses = []
# For each position, predict the next token
for pos in range(n):
logits = self.forward(token_sequence[pos], pos, keys, vals)
probs = self._softmax(logits)
target = token_sequence[pos + 1]
losses.append(-probs[target].log()) # cross-entropy
# Average loss across all positions
loss = (1 / n) * sum(losses)
loss.backward() # traces through ENTIRE computation graph
# Adam optimizer with linear LR decay
self.step_count += 1
lr_t = lr * (1 - self.step_count / 10000) # decay
lr_t = max(lr_t, lr * 0.1) # floor at 10%
b1, b2, eps = 0.85, 0.99, 1e-8
for i, p in enumerate(self.params):
# Momentum + adaptive learning rate
self.m[i] = b1 * self.m[i] + (1-b1) * p.grad
self.v[i] = b2 * self.v[i] + (1-b2) * p.grad**2
mh = self.m[i] / (1 - b1**self.step_count) # bias correction
vh = self.v[i] / (1 - b2**self.step_count)
p.data -= lr_t * mh / (vh**0.5 + eps)
p.grad = 0 # reset for next stepAfter training, the model has learned "what usually follows what." To detect anomalies, feed in a new observation and measure how surprised the model is.
def sequence_anomaly_score(atom, token_sequence):
n = min(atom.block_size, len(token_sequence) - 1)
keys = [[] for _ in range(atom.n_layer)]
vals = [[] for _ in range(atom.n_layer)]
per_token = []
for pos in range(n):
logits = atom.forward(token_sequence[pos], pos, keys, vals)
probs = atom._softmax(logits)
target = token_sequence[pos + 1]
target_prob = probs[target].data
score = -math.log(max(target_prob, 1e-10)) # surprise!
per_token.append((token_name, score, target_prob))
avg = sum(s for _, s, _ in per_token) / len(per_token)
return avg, per_tokenC5 M5 D4 S1 L1 N1 Model: "Seen this 1000 times" Score: 0.72 (low surprise)
C9 M9 D4 S4 L4 N1 Model: "C9?! M9?! Never!" Score: 7.98 (high surprise) Per-token: C9=9.38 M9=12.11
Karpathy's microgpt and KIRI's atoms are the exact same architecture. The only difference is what the tokens mean.
| microgpt (Karpathy) | KIRI Pulse Atom | |
|---|---|---|
| Vocabulary | 26 letters + BOS = 27 | 42 state tokens + BOS = 43 |
| Training data | "emma" "olivia" "ava" | "C5 M5 D4 S1 L1 N1" |
| Prediction | Next character | Next metric value |
| Use case | Generate baby names | Detect anomalies |
| Params | 4,192 | 27,840 |
| Autograd | Same Value class | Same Value class |
| Attention | Same multi-head | Same multi-head |
| Training | Same Adam + cross-entropy | Same Adam + cross-entropy |
This is what makes atoms composable. Every atom speaks the same "language of sequences" internally. The Pipe connects them: output of one atom's prediction feeds into the next atom's input. Nerve sits on top and learns which combinations of atom scores should trigger which actions.
Connects to the KIRI server at localhost:7745. Shows real-time metrics, anomaly scores, and history. Retrain models directly from this page.
cd ~/Binnode
python3 -m kiri.server --collect --interval 30
Three signals tell you whether training worked.
Random weights start at loss ~3.76 (that's -log(1/43) — uniform distribution over 43 tokens). If training is working, loss should drop below 1.0 within 200-300 steps.
step 1/500 | loss 3.98 ← random, knows nothing step 50/500 | loss 0.31 ← learning fast step 250/500 | loss 0.71 ← fluctuation normal (batch size 1) step 500/500 | loss 0.59 ← converged
If loss stays above 3.0 after 100 steps → something is wrong (bad data, wrong schema, corrupt weights).
If loss stays between 1.0-2.0 → learning but needs more steps or more data.
If loss drops below 0.5 → well trained, or possibly overfitting (check signal 2).
The model must give low scores to normal patterns and high scores to unusual ones. Run the anomaly comparison:
python3 -m kiri.atoms.pulse.train --data 'kiri/data/pulse_*.jsonl' --steps 500 --verbose # Expected output: normal (moderate load): state: <BOS> C5 M5 D4 S1 L1 N1 avg score: 0.72 anomalous (maxed out): state: <BOS> C9 M9 D4 S4 L4 N1 avg score: 7.98 model correctly finds anomaly more surprising (7.98 > 0.72)
Good: anomaly score is 5×+ higher than normal.
Bad: scores are similar, or normal is higher than anomaly. Needs more data or more training steps.
Check which specific tokens have high surprise. They should be the ones that are actually unusual.
# Per-token breakdown from anomalous observation: C9 score=9.38 prob=0.000 ← CPU 95%: very unusual M9 score=12.11 prob=0.000 ← Memory 90%: very unusual D4 score=0.15 prob=0.858 ← Disk 40%: normal S4 score=9.19 prob=0.000 ← Swap 80%: very unusual L4 score=6.90 prob=0.001 ← Load 18: very unusual N1 score=0.14 prob=0.870 ← Network up: normal
C9, M9, S4, L4 have high scores because the model rarely saw them during training. D4 and N1 are fine — disk at 40% and network up are common. The model tells you exactly what's wrong.
| Problem | Symptom | Fix |
|---|---|---|
| Underfitting | All scores are high (2-4), even for normal data | More training steps (1000+) or more data |
| Overfitting | Training loss near 0 but anomaly detection poor | More data variety, fewer steps, or larger model |
| Inverted scores | Normal scores higher than anomalous | Data is too uniform; model learned wrong patterns |
| Flat scores | Everything scores ~2.0 regardless of input | Check schema matches data; tokens may be misconfigured |
Each weight file is a JSON dictionary of 2D arrays. The matrices encode everything the model learned:
# peek inside pulse_weights.json: { "config": {"n_embd":32, "n_head":4, "n_layer":2, "block_size":16}, "step_count": 500, "weights": { "wte": [[0.042, -0.11, ...], ...], # 43×32: what each token "means" "wpe": [[0.08, 0.03, ...], ...], # 16×32: what each position "means" "l0.wq": [[...], ...], # 32×32: what to look for ... }, "adam_m": [...], # momentum state (27,840 values) "adam_v": [...] # variance state (27,840 values) }
wte rows that are similar = tokens that behave similarly. If wte[C5] and wte[C6] are close, the model treats CPU 50% and 60% as interchangeable (which is correct). If wte[C9] is far from everything else, the model considers CPU 95% as unusual.
# Count training data: wc -l kiri/data/pulse_*.jsonl # Peek at raw observations: head -3 kiri/data/pulse_2026-02-14.jsonl # Check weight file size: ls -lh atoms/pulse/weights/pulse_weights.json # Quick score check via API: curl -s localhost:7745/api/status | python3 -m json.tool # History check: curl -s 'localhost:7745/api/history?n=5&atom=pulse' | python3 -m json.tool
What becomes possible when tiny language models can be trained on any sequential data, deployed anywhere, and looped into self-improving systems.
Rhythm already tracks keyboard/mouse idle time. Patterns in idle data correlate with sleep quality and energy levels. A 3-hour idle block starting at 11pm followed by activity at 6am = 7 hours sleep. The model learns YOUR patterns and flags deviations — no wearable needed.
Drift tracks tasks added, completed, and project switches. The ratio tells the story: 8 added / 0 completed / 5 switches = high drift score. Over time, the model learns your normal task rhythm and flags days when you're overcommitting. Early warning for burnout.
Mouse movement patterns during work blocks. High activity with few switches = deep focus. Erratic movement with frequent switches = scattered. An atom trained on movement sequences could score focus quality in real time.
When Nerve has a month of cross-atom data: "When morning Rhythm is off-pattern AND Drift shows high task switching, afternoon Pulse anomalies are 3× more likely." The model can predict infrastructure stress from behavioral data, hours in advance.
The AtomTorch module already runs on Apple Silicon's MPS GPU. Training at 27K params takes 17.6 seconds. Scale up:
| Params | n_embd | n_layer | Est. Training (500 steps) | What It Enables |
|---|---|---|---|---|
| 27,840 | 32 | 2 | 17.6s | Current: 6-token state sequences |
| ~100K | 64 | 3 | ~1 min | Longer sequences, finer buckets |
| ~500K | 128 | 4 | ~5 min | Cross-domain correlation within one atom |
| ~1M | 256 | 4 | ~15 min | Complex temporal patterns, hour-scale context |
| ~10M | 512 | 6 | ~2 hours | Day-scale patterns, multi-system awareness |
Same architecture. Same Value class, same attention, same training loop. Just bigger matrices. The code doesn't change — only the hyperparameters.
| Timeline | What Nerve Learns |
|---|---|
| Week 1 | Independent atom alerts. 3 pings for 1 situation. Annoying but data is collecting. Nerve has no training data yet. |
| Month 1 | Nerve has ~200 approve/dismiss decisions. Learns: "Pulse anomaly during Rhythm-active = real alert. Pulse anomaly during Rhythm-idle = probably a background job, suppress." False alarm rate drops 60%. |
| Month 3 | Cross-domain patterns emerge. "Monday morning Drift-spike predicts Wednesday Pulse-anomaly." Nerve starts making predictions, not just reactions. You get warnings before problems manifest. |
Anyone with a computer and Python 3 can train an atom on their own data. No cloud. No API keys. No cost. The model runs locally, learns locally, and stays local. Some ideas:
Power consumption patterns. Water usage anomaly detection. Temperature/humidity tracking. Appliance health monitoring. Garden soil moisture sequences.
Motor vibration patterns. Tool usage sequences. Production line timing anomalies. Quality metrics trending. Equipment bearing wear prediction.
Soil sensor sequences. Pump current and vibration. Irrigation flow rates. Weather pattern correlation. Crop growth stage tracking.
Water pressure and flow monitoring. Solar inverter output patterns. Network uptime tracking. Shared resource utilization. Environmental monitoring stations.