research · documentation

Learn Everything

Run KIRI, understand the code line by line, watch it score live metrics, check if the model is actually learning, and see where this architecture goes next.

Every Command You Need

KIRI has zero external dependencies. Python 3 stdlib only. PyTorch is optional (for GPU acceleration).

Project Structure

~/Binnode/kiri/
  core/
    value.py           # scalar autograd engine (backpropagation)
    language.py        # quantizes continuous values into token buckets
    atom.py            # decoder-only transformer + Adam optimizer
    atom_torch.py      # PyTorch port (MPS accelerated, 27x faster)
    pipe.py            # linear composition engine
  atoms/
    pulse/             # infrastructure metrics (CPU, mem, disk, load)
    rhythm/            # work patterns (keyboard/mouse idle time)
    drift/             # task patterns (tasks added/completed/switched)
    nerve/             # decision engine (aggregates atom scores)
  data/                # collected JSONL files (one per day)
  server.py            # HTTP API server (port 7745)
  kiri.py              # daemon: collect + score + decide + act
  config.py            # global configuration

1. Generate Synthetic Data

Before training, you need data. Generate a week of synthetic observations to test with.

# From ~/Binnode (parent of kiri/):
python3 -m kiri.atoms.pulse.collect --dry-run        # 7 days synthetic infra data
python3 -m kiri.atoms.rhythm.collect --dry-run       # 7 days synthetic work data
python3 -m kiri.atoms.drift.collect --dry-run        # 7 days synthetic task data
python3 -m kiri.atoms.nerve.collect --dry-run        # 7 days synthetic nerve data

# Files land in kiri/data/pulse_2026-02-07.jsonl, etc.
# Each line is one JSON observation: {"C":52,"M":55,"D":40,...}

2. Train All Atoms

Each atom trains on its own data files. 500 steps takes ~8 minutes in pure Python, ~18 seconds with PyTorch/MPS.

python3 -m kiri.atoms.pulse.train --data 'kiri/data/pulse_*.jsonl' --steps 500 --verbose
python3 -m kiri.atoms.rhythm.train --data 'kiri/data/rhythm_*.jsonl' --steps 500 --verbose
python3 -m kiri.atoms.drift.train --data 'kiri/data/drift_*.jsonl' --steps 500 --verbose
python3 -m kiri.atoms.nerve.train --data 'kiri/data/nerve_*.jsonl' --steps 500 --verbose

# Weights saved to atoms/pulse/weights/pulse_weights.json, etc.
# --verbose runs anomaly comparison after training

3. Collect Live Data

Collect real metrics from your machine. Single shot or continuous.

# Single observation (instant):
python3 -m kiri.atoms.pulse.collect

# Continuous — every 5s for 1 hour:
python3 -m kiri.atoms.pulse.collect --interval 5 --duration 3600

# Blast mode — every 1s for 10 minutes:
python3 -m kiri.atoms.pulse.collect --interval 1 --duration 600

4. Start the Server

The API server serves status, history, training, and collection endpoints.

# Start server on port 7745:
python3 -m kiri.server --port 7745

# With background collection (every 30s):
python3 -m kiri.server --collect --interval 30

# API endpoints:
#   GET  /api/status       → live collect + score
#   GET  /api/history?n=100 → scored historical observations
#   POST /api/train?atom=pulse&steps=300 → streaming NDJSON
#   GET  /api/collect      → single collection cycle

5. Run the Daemon

The full daemon collects from all sources, scores through each atom, and lets Nerve decide what to do.

python3 -m kiri.kiri

# Runs forever: collect → score → decide → act → loop
# Decisions: ok (log), alert (Telegram), suppress, retrain

6. Log Task Activity (Drift Atom)

python3 -m kiri.atoms.drift.collect --added 3 --completed 1 --switched 2

# Logs: 3 tasks added, 1 completed, 2 project switches
# Drift detects scope creep patterns from these numbers

Idempotency Guarantees

Operation	Behavior	Safe to repeat?
collect	Appends to daily JSONL file	Yes — never overwrites
train	Overwrites weight files	Yes — weights are replaced atomically
score	Read-only forward pass	Yes — pure computation, no side effects
dry-run	Overwrites synthetic data files	Yes — same seed produces same data

PyTorch Acceleration (Optional)

# Create a virtualenv with PyTorch:
python3 -m venv kiri-env
source kiri-env/bin/activate
pip install torch

# AtomTorch is a drop-in replacement for Atom:
from kiri.core import AtomTorch  # uses MPS on Apple Silicon
# 500 steps: 17.6s (PyTorch/MPS) vs ~8min (pure Python) = 27x faster

The Code, Line by Line

Seven sections walk through the entire architecture. Each section shows code on top, plain English explanation below, and a Canvas visualization where it helps. Use Prev/Next to walk through.

1 / 7

1. Data & Tokenizer

Karpathy's microgpt trains on baby names. 26 lowercase letters + a BOS token = 27 vocabulary. KIRI replaces this with state tokens: CPU buckets, memory buckets, load average, etc.

microgpt (English)

docs = ["emma","olivia","ava"]
uchars = sorted(set(''.join(docs)))
# ['a','e','i','l','m','o','v']
BOS = len(uchars)  # 7
vocab_size = 8     # 7 chars + BOS

KIRI (State)

schema = {
  'C': (0,100,10), # CPU: C0..C9
  'M': (0,100,10), # Mem: M0..M9
  'D': (0,100,10), # Disk: D0..D9
  'S': (0,100, 5), # Swap: S0..S4
  'L': (0, 20, 5), # Load: L0..L4
  'N': (0,  1, 2), # Net:  N0 N1
}
# 42 tokens + BOS = 43 vocab

The quantization trick: CPU 52% → bucket 5 → token "C5". Memory 73% → bucket 7 → "M7". This keeps vocabulary small (43 tokens instead of infinite continuous values) so the model stays tiny.

Key insight: A GPT doesn't know what language it's speaking. It predicts the next token in a sequence. Change the vocabulary from characters to state tokens and it becomes an anomaly detector. Same math. Different meaning.

2. Value — The Autograd Engine

Every neural network needs backpropagation. The Value class implements it in ~40 lines. Each Value wraps a scalar number and tracks how it was created, so gradients can flow backwards through any computation.

class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads')

    def __init__(self, data, children=(), local_grads=()):
        self.data = data          # the scalar value (a float)
        self.grad = 0             # d(loss)/d(self) — filled by backward()
        self._children = children # parent Values that created this one
        self._local_grads = local_grads  # d(self)/d(child) for each parent

Every operation creates a new Value that remembers its parents:

    def __add__(self, other):
        # d(a+b)/da = 1, d(a+b)/db = 1
        return Value(self.data + other.data, (self, other), (1, 1))

    def __mul__(self, other):
        # d(a*b)/da = b, d(a*b)/db = a
        return Value(self.data * other.data, (self, other), (other.data, self.data))

    def log(self):
        # d(log(x))/dx = 1/x
        return Value(math.log(self.data), (self,), (1/self.data,))

    def relu(self):
        # d(relu(x))/dx = 1 if x > 0 else 0
        return Value(max(0, self.data), (self,), (float(self.data > 0),))

backward() walks the computation graph in reverse, applying the chain rule:

    def backward(self):
        topo, visited = [], set()
        def build_topo(v):              # DFS post-order traversal
            if v not in visited:
                visited.add(v)
                for child in v._children: build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1                   # d(loss)/d(loss) = 1
        for v in reversed(topo):
            for child, lg in zip(v._children, v._local_grads):
                child.grad += lg * v.grad  # THE chain rule

This IS backpropagation. PyTorch, JAX, TensorFlow all do the same thing but on tensors. Understanding these 40 lines = understanding how all neural networks train. The chain rule: d(loss)/d(x) = d(loss)/d(y) × d(y)/d(x), accumulated through the graph.

3. Initialize Weights

The model starts as random numbers. Each weight matrix is initialized with small Gaussian noise (std=0.08). These numbers will be adjusted by training until the model can predict the next token.

class Atom:
    def __init__(self, lang, n_embd=32, n_head=4, n_layer=2, block_size=16):
        # Random matrix: nout rows, nin columns, each Value(gauss(0, 0.08))
        mat = lambda nout, nin, std=0.08: [
            [Value(random.gauss(0, std)) for _ in range(nin)]
            for _ in range(nout)
        ]

        self.sd = {
            'wte':     mat(43, 32),    # token embeddings: 43 vocab × 32 dims
            'wpe':     mat(16, 32),    # position embeddings: 16 positions × 32 dims
            'lm_head': mat(43, 32),    # output projection: 43 vocab × 32 dims
        }
        # Per layer: 4 attention matrices + 2 MLP matrices
        for i in range(2):             # 2 layers
            self.sd[f'l{i}.wq'] = mat(32, 32)   # query projection
            self.sd[f'l{i}.wk'] = mat(32, 32)   # key projection
            self.sd[f'l{i}.wv'] = mat(32, 32)   # value projection
            self.sd[f'l{i}.wo'] = mat(32, 32)   # output projection
            self.sd[f'l{i}.f1'] = mat(128, 32)  # MLP expand (4×)
            self.sd[f'l{i}.f2'] = mat(32, 128)  # MLP compress

Parameter Count

Component	Shape	Params	Purpose
wte	43 × 32	1,376	Token embeddings
wpe	16 × 32	512	Position embeddings
lm_head	43 × 32	1,376	Output projection
attention	2 layers × 4 × (32×32)	16,384	Q, K, V, O per layer
MLP	2 layers × (128×32 + 32×128)	8,192	Feed-forward network
Total		27,840

27,840 numbers. That's the entire model. A single JSON file, smaller than most images. Every number starts random and gets adjusted until the model understands your system's patterns.

4. Forward Pass

Given a token, predict the probability distribution over the next token. This is the core computation that both training and scoring use.

def forward(self, token_id, pos_id, keys, values):
    sd, hd, nh = self.sd, self.head_dim, self.n_head

    # 1. Look up token embedding + position embedding
    x = [t + p for t, p in zip(sd['wte'][token_id], sd['wpe'][pos_id])]
    x = self._rmsnorm(x)      # normalize: x / sqrt(mean(x^2))

    # 2. For each transformer layer:
    for li in range(self.n_layer):
        xr = x                 # save for residual connection

        # Multi-head attention
        x = self._rmsnorm(x)
        q = self._linear(x, sd[f'l{li}.wq'])  # query: "what am I looking for?"
        k = self._linear(x, sd[f'l{li}.wk'])  # key: "what do I contain?"
        v = self._linear(x, sd[f'l{li}.wv'])  # value: "what info do I carry?"

        # KV cache: remember keys/values from all positions
        keys[li].append(k)
        values[li].append(v)

        # 4-head attention: each head sees 8 dims (32/4)
        xa = []
        for h in range(nh):
            # Attention scores: Q · K^T / sqrt(head_dim)
            al = [sum(qh[j]*kh[t][j]...) / hd**0.5 ...]
            aw = self._softmax(al)    # attention weights
            ho = [sum(aw[t]*vh[t][j]...) ...]  # weighted sum
            xa.extend(ho)

        x = self._linear(xa, sd[f'l{li}.wo'])
        x = [a + b for a, b in zip(x, xr)]  # + residual

        # MLP: expand 4x, ReLU, compress back
        xr = x
        x = self._rmsnorm(x)
        x = self._linear(x, sd[f'l{li}.f1'])  # 32 → 128
        x = [xi.relu() for xi in x]            # non-linearity
        x = self._linear(x, sd[f'l{li}.f2'])  # 128 → 32
        x = [a + b for a, b in zip(x, xr)]    # + residual

    # 3. Project to vocabulary logits
    return self._linear(x, sd['lm_head'])  # 32 → 43 logits

Residual connections are the secret to deep networks. Each layer adds its output to its input (x = x + layer(x)). This means gradients flow directly through the skip connection during backprop, preventing them from vanishing. Even a 100-layer network can train because of residuals.

5. Training Loop

For each token in a sequence, predict the next one. The loss is the average negative log-probability of the correct answers. Backpropagate. Update weights with Adam.

def train_step(self, token_sequence, lr=0.01):
    n = min(self.block_size, len(token_sequence) - 1)
    keys = [[] for _ in range(self.n_layer)]
    vals = [[] for _ in range(self.n_layer)]
    losses = []

    # For each position, predict the next token
    for pos in range(n):
        logits = self.forward(token_sequence[pos], pos, keys, vals)
        probs = self._softmax(logits)
        target = token_sequence[pos + 1]
        losses.append(-probs[target].log())  # cross-entropy

    # Average loss across all positions
    loss = (1 / n) * sum(losses)
    loss.backward()  # traces through ENTIRE computation graph

    # Adam optimizer with linear LR decay
    self.step_count += 1
    lr_t = lr * (1 - self.step_count / 10000)  # decay
    lr_t = max(lr_t, lr * 0.1)                  # floor at 10%
    b1, b2, eps = 0.85, 0.99, 1e-8

    for i, p in enumerate(self.params):
        # Momentum + adaptive learning rate
        self.m[i] = b1 * self.m[i] + (1-b1) * p.grad
        self.v[i] = b2 * self.v[i] + (1-b2) * p.grad**2
        mh = self.m[i] / (1 - b1**self.step_count)  # bias correction
        vh = self.v[i] / (1 - b2**self.step_count)
        p.data -= lr_t * mh / (vh**0.5 + eps)
        p.grad = 0  # reset for next step

Adam = momentum + adaptive rates. m tracks the running average of gradients (momentum). v tracks the running average of squared gradients (how noisy). Together: move confidently in consistent directions, cautiously in noisy ones. b1=0.85 is more aggressive than the standard 0.9, which works for small models.

6. Inference & Anomaly Scoring

After training, the model has learned "what usually follows what." To detect anomalies, feed in a new observation and measure how surprised the model is.

def sequence_anomaly_score(atom, token_sequence):
    n = min(atom.block_size, len(token_sequence) - 1)
    keys = [[] for _ in range(atom.n_layer)]
    vals = [[] for _ in range(atom.n_layer)]
    per_token = []

    for pos in range(n):
        logits = atom.forward(token_sequence[pos], pos, keys, vals)
        probs = atom._softmax(logits)
        target = token_sequence[pos + 1]
        target_prob = probs[target].data
        score = -math.log(max(target_prob, 1e-10))  # surprise!
        per_token.append((token_name, score, target_prob))

    avg = sum(s for _, s, _ in per_token) / len(per_token)
    return avg, per_token

How Scoring Works

Normal (work hours)

C5 M5 D4 S1 L1 N1
Model: "Seen this 1000 times"
Score: 0.72 (low surprise)

Anomalous (3am spike)

C9 M9 D4 S4 L4 N1
Model: "C9?! M9?! Never!"
Score: 7.98 (high surprise)
Per-token: C9=9.38 M9=12.11

No rules needed. No "if CPU > 90% then alert." The model learns what's normal from your data and flags anything that doesn't fit. Anomalous is 11× more surprising than normal. The model identifies WHICH metrics are unusual (C9, M9 have highest individual scores) and by how much.

7. KIRI = Same Model, Different Vocabulary

Karpathy's microgpt and KIRI's atoms are the exact same architecture. The only difference is what the tokens mean.

	microgpt (Karpathy)	KIRI Pulse Atom
Vocabulary	26 letters + BOS = 27	42 state tokens + BOS = 43
Training data	"emma" "olivia" "ava"	"C5 M5 D4 S1 L1 N1"
Prediction	Next character	Next metric value
Use case	Generate baby names	Detect anomalies
Params	4,192	27,840
Autograd	Same Value class	Same Value class
Attention	Same multi-head	Same multi-head
Training	Same Adam + cross-entropy	Same Adam + cross-entropy

The fundamental insight: A language model is a general-purpose sequence predictor. If you can express your problem as "predict the next token in a sequence," you can solve it with a tiny GPT. Infrastructure metrics, work patterns, task states, sensor data, financial transactions — they all become token sequences. Same architecture. Different vocabulary. Different purpose.

This is what makes atoms composable. Every atom speaks the same "language of sequences" internally. The Pipe connects them: output of one atom's prediction feeds into the next atom's input. Nerve sits on top and learns which combinations of atom scores should trigger which actions.

Live System

Connects to the KIRI server at localhost:7745. Shows real-time metrics, anomaly scores, and history. Retrain models directly from this page.

Not connected to KIRI server.
Start it:

cd ~/Binnode
python3 -m kiri.server --collect --interval 30

Is the Model Learning?

Three signals tell you whether training worked.

Signal 1: Loss Decreases

Random weights start at loss ~3.76 (that's -log(1/43) — uniform distribution over 43 tokens). If training is working, loss should drop below 1.0 within 200-300 steps.

step    1/500 | loss 3.98    ← random, knows nothing
step   50/500 | loss 0.31    ← learning fast
step  250/500 | loss 0.71    ← fluctuation normal (batch size 1)
step  500/500 | loss 0.59    ← converged

If loss stays above 3.0 after 100 steps → something is wrong (bad data, wrong schema, corrupt weights).

If loss stays between 1.0-2.0 → learning but needs more steps or more data.

If loss drops below 0.5 → well trained, or possibly overfitting (check signal 2).

Signal 2: Anomaly Scores Differentiate

The model must give low scores to normal patterns and high scores to unusual ones. Run the anomaly comparison:

python3 -m kiri.atoms.pulse.train --data 'kiri/data/pulse_*.jsonl' --steps 500 --verbose

# Expected output:
normal (moderate load):
  state: <BOS> C5 M5 D4 S1 L1 N1
  avg score: 0.72

anomalous (maxed out):
  state: <BOS> C9 M9 D4 S4 L4 N1
  avg score: 7.98

model correctly finds anomaly more surprising (7.98 > 0.72)

Good: anomaly score is 5×+ higher than normal.

Bad: scores are similar, or normal is higher than anomaly. Needs more data or more training steps.

Signal 3: Per-Token Scores Make Sense

Check which specific tokens have high surprise. They should be the ones that are actually unusual.

# Per-token breakdown from anomalous observation:
  C9  score=9.38  prob=0.000   ← CPU 95%: very unusual
  M9  score=12.11 prob=0.000   ← Memory 90%: very unusual
  D4  score=0.15  prob=0.858   ← Disk 40%: normal
  S4  score=9.19  prob=0.000   ← Swap 80%: very unusual
  L4  score=6.90  prob=0.001   ← Load 18: very unusual
  N1  score=0.14  prob=0.870   ← Network up: normal

C9, M9, S4, L4 have high scores because the model rarely saw them during training. D4 and N1 are fine — disk at 40% and network up are common. The model tells you exactly what's wrong.

When the Model Is Wrong

Problem	Symptom	Fix
Underfitting	All scores are high (2-4), even for normal data	More training steps (1000+) or more data
Overfitting	Training loss near 0 but anomaly detection poor	More data variety, fewer steps, or larger model
Inverted scores	Normal scores higher than anomalous	Data is too uniform; model learned wrong patterns
Flat scores	Everything scores ~2.0 regardless of input	Check schema matches data; tokens may be misconfigured

What Are Weight Matrices?

Each weight file is a JSON dictionary of 2D arrays. The matrices encode everything the model learned:

# peek inside pulse_weights.json:
{
  "config": {"n_embd":32, "n_head":4, "n_layer":2, "block_size":16},
  "step_count": 500,
  "weights": {
    "wte": [[0.042, -0.11, ...], ...],   # 43×32: what each token "means"
    "wpe": [[0.08, 0.03, ...], ...],     # 16×32: what each position "means"
    "l0.wq": [[...], ...],              # 32×32: what to look for
    ...
  },
  "adam_m": [...],  # momentum state (27,840 values)
  "adam_v": [...]   # variance state (27,840 values)
}

wte rows that are similar = tokens that behave similarly. If wte[C5] and wte[C6] are close, the model treats CPU 50% and 60% as interchangeable (which is correct). If wte[C9] is far from everything else, the model considers CPU 95% as unusual.

Inspect Commands

# Count training data:
wc -l kiri/data/pulse_*.jsonl

# Peek at raw observations:
head -3 kiri/data/pulse_2026-02-14.jsonl

# Check weight file size:
ls -lh atoms/pulse/weights/pulse_weights.json

# Quick score check via API:
curl -s localhost:7745/api/status | python3 -m json.tool

# History check:
curl -s 'localhost:7745/api/history?n=5&atom=pulse' | python3 -m json.tool

The Self-Training Loop

Nerve closes the loop. When Pulse detects an anomaly, Nerve decides: alert, suppress, or retrain. Your response (approve/dismiss) feeds back as Nerve's training data. Over weeks, Nerve learns which alerts matter to you. The system improves by running.

Week 1: Independent atoms, noisy alerts.
Week 3: Nerve suppresses false alarms.
Month 2: Cross-domain insights (work pattern + infra state).
Month 6: Partial automation.

Dream Further

What becomes possible when tiny language models can be trained on any sequential data, deployed anywhere, and looped into self-improving systems.

Already Possible (Data Exists)

Sleep & Energy Inference

Rhythm already tracks keyboard/mouse idle time. Patterns in idle data correlate with sleep quality and energy levels. A 3-hour idle block starting at 11pm followed by activity at 6am = 7 hours sleep. The model learns YOUR patterns and flags deviations — no wearable needed.

Scope Creep Detection

Drift tracks tasks added, completed, and project switches. The ratio tells the story: 8 added / 0 completed / 5 switches = high drift score. Over time, the model learns your normal task rhythm and flags days when you're overcommitting. Early warning for burnout.

Focus Scoring

Mouse movement patterns during work blocks. High activity with few switches = deep focus. Erratic movement with frequent switches = scattered. An atom trained on movement sequences could score focus quality in real time.

Compound Predictions

When Nerve has a month of cross-atom data: "When morning Rhythm is off-pattern AND Drift shows high task switching, afternoon Pulse anomalies are 3× more likely." The model can predict infrastructure stress from behavioral data, hours in advance.

MPS GPU Scaling

The AtomTorch module already runs on Apple Silicon's MPS GPU. Training at 27K params takes 17.6 seconds. Scale up:

Params	n_embd	n_layer	Est. Training (500 steps)	What It Enables
27,840	32	2	17.6s	Current: 6-token state sequences
~100K	64	3	~1 min	Longer sequences, finer buckets
~500K	128	4	~5 min	Cross-domain correlation within one atom
~1M	256	4	~15 min	Complex temporal patterns, hour-scale context
~10M	512	6	~2 hours	Day-scale patterns, multi-system awareness

Same architecture. Same Value class, same attention, same training loop. Just bigger matrices. The code doesn't change — only the hyperparameters.

Nerve + Feedback Loop

Timeline	What Nerve Learns
Week 1	Independent atom alerts. 3 pings for 1 situation. Annoying but data is collecting. Nerve has no training data yet.
Month 1	Nerve has ~200 approve/dismiss decisions. Learns: "Pulse anomaly during Rhythm-active = real alert. Pulse anomaly during Rhythm-idle = probably a background job, suppress." False alarm rate drops 60%.
Month 3	Cross-domain patterns emerge. "Monday morning Drift-spike predicts Wednesday Pulse-anomaly." Nerve starts making predictions, not just reactions. You get warnings before problems manifest.

The Fundamental Insight

Same model. Different vocabulary.

A language model is a sequence predictor. If you can turn your problem into a sequence of tokens, you can train a tiny GPT on it. Infrastructure metrics, work patterns, task states, sensor readings, financial transactions, medical vitals, network traffic — they all become token sequences.

The model is 200 lines of code. The vocabulary is the creative part. Each new vocabulary = a new atom. Each new atom = a new sense. Connect enough senses through Nerve and the system develops something that looks a lot like understanding.

What Could You Build?

Anyone with a computer and Python 3 can train an atom on their own data. No cloud. No API keys. No cost. The model runs locally, learns locally, and stays local. Some ideas:

Home

Power consumption patterns. Water usage anomaly detection. Temperature/humidity tracking. Appliance health monitoring. Garden soil moisture sequences.

Workshop / Factory

Motor vibration patterns. Tool usage sequences. Production line timing anomalies. Quality metrics trending. Equipment bearing wear prediction.

Agriculture

Soil sensor sequences. Pump current and vibration. Irrigation flow rates. Weather pattern correlation. Crop growth stage tracking.

Community Infrastructure

Water pressure and flow monitoring. Solar inverter output patterns. Network uptime tracking. Shared resource utilization. Environmental monitoring stations.

KIRI — an Eryx Labs project