research · hardware

KIRI on Hardware

A Pulse atom is 27,840 numbers. An ESP32-S3 has room for 131,000. The forward pass is multiply-and-add. A $3 microcontroller and 200 lines of C.

The Short Answer

Can a tiny transformer run on a microcontroller? Yes, with massive headroom.

What	Number	Why it matters
Pulse atom parameters	27,840	Each is an int16 = 55,680 bytes = 54 KB
ESP32-S3 SRAM	512 KB	55 KB model + scratch space = fits 9× over
Forward pass operations	~200 cycles	Matrix multiply + add. No division, no transcendentals in critical path
Inference time (240 MHz)	<1 ms	Score one observation in under a millisecond
Chip cost	~$3	ESP32-S3-WROOM-1 module, volume pricing

The entire model fits in SRAM. No external memory, no SD card, no flash reads during inference. All 27,840 weights sit in RAM. The forward pass is pure arithmetic on numbers already in memory. This is why tiny models matter for hardware.

The Complete Circuit

A transformer, viewed as hardware. Every software operation maps to a hardware unit. The CPU in the ESP32 already has all of these.

Hardware Units

Unit	Software Operation	Hardware	Cycles
SRAM	Weight storage (wte, wpe, layers)	On-chip SRAM bank	1 (read)
MAC Array	Matrix multiply (linear layers)	Multiply-accumulate unit	N×M per matrix
ReLU Comparator	max(0, x)	Single comparison + mux	1
Softmax LUT	exp(x) / sum(exp)	Lookup table + accumulator	V (vocab size)
RMSNorm	x / sqrt(mean(x²))	Accumulator + reciprocal sqrt	N (embed dim)
Sequencer	Token-by-token processing loop	State machine controller	1 per token
I/O	Sensor read + score output	ADC/I2C/SPI + UART/LED	Variable

Execution Timeline (One Token)

// Processing one token through the Pulse atom (2 layers, 32-dim, 4 heads):

1. Embedding lookup         2 reads × 32 dims          = 64 ops
2. RMSNorm                  32 multiply + 32 add + sqrt  = ~70 ops
3. Attention Q,K,V          3 × (32×32 MAC)             = 3,072 ops
4. Score + softmax          4 heads × T scores           = ~128 ops
5. Weighted sum + O proj    32×32 MAC                    = 1,024 ops
6. Residual add             32 adds                       = 32 ops
7. MLP (expand + ReLU)      128×32 MAC + 128 relu         = 4,224 ops
8. MLP (compress)           32×128 MAC                    = 4,096 ops
9. Residual add             32 adds                       = 32 ops
10. ×2 layers              (steps 2-9) × 2              = ×2
11. lm_head + softmax       43×32 MAC + 43 exp           = 1,419 ops
────────────────────────────────────────────────────────
Total: ~27,000 operations per token, ~160,000 for a 6-token sequence
At 240 MHz (ESP32-S3): <1 ms inference time

ESP32-S3 Fit Analysis

Detailed breakdown of whether the Pulse atom fits on an ESP32-S3-WROOM-1 module.

Resource	Available	Required	Headroom
SRAM	512 KB	~55 KB (weights as int16)	9.3×
Flash	4–16 MB	~56 KB (weight file)	70×+
Clock	240 MHz (Xtensa LX7)	~160K ops per inference	<1 ms
RAM (scratch)	~400 KB free after OS	~8 KB (intermediates + KV cache)	50×
Power	3.3V, ~240 mA active	Inference burst <1 ms	Sleep between readings
Cost	—	~$3 (module)	$8–15 total BOM

The model is 1/9th of available SRAM. You could fit 9 Pulse atoms, or 3 different atoms (Pulse + Rhythm + Drift) with room to spare. The ESP32-S3 also has WiFi and Bluetooth, so it can phone home with anomaly scores.

PCB Design

A real board layout for a KIRI sensor node. 4-layer PCB, 50mm × 35mm.

Bill of Materials

Component	Part	Est. Cost
MCU Module	ESP32-S3-WROOM-1 (4MB flash)	$3.00
USB-C Connector	USB-C 2.0 receptacle	$0.30
Voltage Regulator	AMS1117-3.3 (SOT-223)	$0.15
Status LEDs (3x)	0603 Green/Amber/Red	$0.10
Resistors (6x)	0603 assorted	$0.05
Capacitors (4x)	0603 100nF + 10μF	$0.10
Sensor Headers	2.54mm 2x4 pin header	$0.20
PCB	4-layer, 50×35mm (5 pcs)	$1.50
Total		~$5.40

Under $6 per node. At volume (100+), the PCB cost drops to ~$0.50 and component costs drop further. A 10-node deployment for monitoring an entire site costs less than a month of any cloud monitoring subscription.

The C Firmware

The forward pass in C. Fixed-point int16 arithmetic — no floating point needed. Train on computer (Python), export weights as binary, flash to ESP32.

// kiri_atom.h — Fixed-point transformer forward pass
// Weights are int16_t, scaled by 2^10 (10 fractional bits)

#include <stdint.h>

#define N_EMBD    32
#define N_HEAD    4
#define N_LAYER   2
#define VOCAB     43
#define BLOCK_SZ  16
#define HEAD_DIM  (N_EMBD / N_HEAD)  // 8
#define FRAC_BITS 10
#define SCALE     (1 << FRAC_BITS)    // 1024

// Weight matrices (stored in flash, loaded to SRAM on boot)
typedef struct {
    int16_t wte[VOCAB][N_EMBD];          // 43 × 32 = 1,376
    int16_t wpe[BLOCK_SZ][N_EMBD];      // 16 × 32 = 512
    int16_t lm_head[VOCAB][N_EMBD];     // 43 × 32 = 1,376
    // Per layer: wq, wk, wv, wo, f1, f2
    int16_t wq[N_LAYER][N_EMBD][N_EMBD];
    int16_t wk[N_LAYER][N_EMBD][N_EMBD];
    int16_t wv[N_LAYER][N_EMBD][N_EMBD];
    int16_t wo[N_LAYER][N_EMBD][N_EMBD];
    int16_t f1[N_LAYER][4*N_EMBD][N_EMBD];
    int16_t f2[N_LAYER][N_EMBD][4*N_EMBD];
} AtomWeights;

// Fixed-point linear layer: out[nout] = W[nout][nin] @ x[nin]
static void linear(int32_t *out, const int16_t *W,
                   const int32_t *x, int nout, int nin) {
    for (int i = 0; i < nout; i++) {
        int32_t acc = 0;
        for (int j = 0; j < nin; j++) {
            acc += (int32_t)W[i * nin + j] * (x[j] >> (FRAC_BITS/2));
        }
        out[i] = acc >> (FRAC_BITS/2);  // keep in Q10 range
    }
}

// ReLU: max(0, x)
static void relu(int32_t *x, int n) {
    for (int i = 0; i < n; i++)
        if (x[i] < 0) x[i] = 0;
}

// Score one observation: returns anomaly score (fixed-point)
int32_t atom_score(const AtomWeights *w, const uint8_t *tokens, int n_tokens) {
    int32_t x[N_EMBD], xr[N_EMBD], tmp[4*N_EMBD];
    int32_t total_score = 0;

    // KV cache for attention
    int32_t k_cache[N_LAYER][BLOCK_SZ][N_EMBD];
    int32_t v_cache[N_LAYER][BLOCK_SZ][N_EMBD];

    for (int pos = 0; pos < n_tokens - 1; pos++) {
        // Embedding: x = wte[token] + wpe[pos]
        for (int j = 0; j < N_EMBD; j++)
            x[j] = ((int32_t)w->wte[tokens[pos]][j]
                  + (int32_t)w->wpe[pos][j]) << (FRAC_BITS/2);

        // rmsnorm(x) — normalize
        rmsnorm(x, N_EMBD);

        // Transformer layers
        for (int li = 0; li < N_LAYER; li++) {
            memcpy(xr, x, sizeof(x));  // save for residual
            rmsnorm(x, N_EMBD);

            // Attention: Q*K^T/sqrt(d) -> softmax -> *V
            // ... (multi-head attention with KV cache) ...

            // MLP: expand -> relu -> compress
            linear(tmp, (int16_t*)w->f1[li], x, 4*N_EMBD, N_EMBD);
            relu(tmp, 4*N_EMBD);
            linear(x, (int16_t*)w->f2[li], tmp, N_EMBD, 4*N_EMBD);

            // Residual connection
            for (int j = 0; j < N_EMBD; j++) x[j] += xr[j];
        }

        // lm_head -> score
        int32_t logits[VOCAB];
        linear(logits, (int16_t*)w->lm_head, x, VOCAB, N_EMBD);

        // -log(softmax(logits)[target]) = surprise
        total_score += neg_log_softmax(logits, tokens[pos+1], VOCAB);
    }

    return total_score / (n_tokens - 1);
}

Train in Python. Deploy in C. The Python Atom trains with full autograd (floating point, gradient tracking). Export weights, quantize to int16, flash to ESP32. The C firmware only does the forward pass — no backward pass, no gradients, no optimizer. Just multiply-and-add.

Weight Export Pipeline

# Python: export weights as binary for ESP32
import struct, json

with open('pulse_weights.json') as f:
    data = json.load(f)

with open('pulse_weights.bin', 'wb') as f:
    for name in ['wte','wpe','lm_head','l0.wq','l0.wk', ...]:
        for row in data['weights'][name]:
            for val in row:
                # Quantize float to int16 (Q10 fixed-point)
                q = int(round(val * 1024))
                q = max(-32768, min(32767, q))
                f.write(struct.pack('<h', q))

# Flash to ESP32:
# esptool.py write_flash 0x100000 pulse_weights.bin

Real Applications

A KIRI node is a $6 board that learns what "normal" looks like for any sensor and alerts on anomalies. No cloud, no subscription, no rules to write. Here's where that matters.

ISP / Telecom Tower Site

Sensors: Voltage, temperature, router CPU, link quality
Tokens: V0-V4 T0-T9 C0-C9 Q0-Q4
Detects: Power supply degradation, overheating patterns, unusual traffic at 3am, link flapping before total failure. The model learns each tower's baseline and flags deviations specific to that site.

Agricultural Pump Station

Sensors: Motor current (CT clamp), vibration (accelerometer), flow rate
Tokens: I0-I9 V0-V9 F0-F4
Detects: Bearing wear (vibration increases weeks before failure), dry running (flow drops but current stays high), blockages. A $6 node saves a $500 pump.

Solar Inverter Monitoring

Sensors: DC input voltage, AC output, efficiency ratio, temperature
Tokens: D0-D9 A0-D9 E0-E4 T0-T9
Detects: Panel degradation (gradual efficiency drop), inverter faults (output waveform anomalies), shading patterns. Learns diurnal patterns and flags real problems vs normal cloud cover.

Community Water System

Sensors: Pressure transducer, flow meter, chlorine sensor
Tokens: P0-P9 F0-F9 C0-C4
Detects: Pipe leaks (pressure drops), demand surges, treatment system failures. Each node monitors a section of pipe. Anomaly at node 3 but not node 4 = leak between them.

Home Appliance Health

Sensors: Current transformer on power line
Tokens: W0-W9 (power draw buckets)
Detects: Fridge compressor degradation (power draw increases over months), washing machine bearing wear, heater element failure. A single CT clamp per appliance, $6 per monitor.

Workshop / Factory Motor

Sensors: Vibration (ADXL345), current (ACS712), temperature (NTC)
Tokens: V0-V9 I0-I9 T0-T9
Detects: Bearing wear signatures, belt slippage, overload conditions. The model learns each motor's vibration fingerprint. Anomaly = maintenance needed before catastrophic failure.

Steps to Build

From proof-of-concept to deployed hardware in four stages.

Prove It

Get an ESP32-S3 dev board (~$8). Port the forward pass to C. Load weights exported from Python. Run the same test observations through both Python and C. Compare scores — they should match within rounding tolerance (int16 quantization introduces small differences). If scores match, the port works.

// ESP32 Arduino setup:
#include "kiri_atom.h"
#include "pulse_weights.h"  // generated binary header

void setup() {
    Serial.begin(115200);
    uint8_t tokens[] = {0, 5, 17, 24, 31, 36, 42}; // BOS C5 M7 D4 S1 L1 N1
    int32_t score = atom_score(&weights, tokens, 7);
    Serial.printf("Score: %d.%03d\n", score/SCALE, (score%SCALE)*1000/SCALE);
}

Add a Sensor

Wire an I2C or ADC sensor to the dev board. Collect 24 hours of data (one reading every 30 seconds = 2,880 observations). Transfer to computer. Train an atom on the data. Export weights back to ESP32. Now the device can score its own sensor readings in real time.

Design the PCB

Open KiCad. Place the ESP32-S3 module, USB-C for power/programming, voltage regulator, sensor headers, and status LEDs. Route on a 4-layer board (signal/ground/power/signal). Generate Gerber files. Upload to a PCB manufacturer. 5 boards for ~$8 shipped.

Deploy

Mount the node where the sensor needs to be. Power via USB-C or battery. Let it collect for 48 hours (building a baseline). Train an atom on the collected data. Flash the weights back. The node now runs independently — collecting, scoring, and alerting over WiFi when anomalies are detected. No cloud subscription. No ongoing cost.

Atoms as Transistors

The pattern repeats at every level of computing. Each level = many of the level below, hiding complexity and enabling new capabilities.

The Abstraction Ladder

Level	Unit	Made of	New capability
1	Transistor	Silicon + dopants	Amplification, switching
2	Logic Gate	~6 transistors	Boolean logic (AND, OR, NOT)
3	CPU	~1B gates	General computation
4	Atom	~28K parameters on a CPU	Pattern recognition, anomaly detection
5	Molecule	Multiple atoms + Nerve	Cross-domain reasoning, decisions
6	Organism	Many molecules, networked	Distributed intelligence, adaptation

The Economics

Giant Model (Cloud)

Cost: $100M+ to train, $0.01+ per inference
Location: Data center
Capability: General-purpose, impressive at everything
Ownership: Rented access
Privacy: Data leaves your network

1000 Tiny Models (Local)

Cost: $0 to train, $0 per inference
Location: On each device
Capability: Specialist, excellent at one thing
Ownership: Yours completely
Privacy: Data never leaves the device

Transistors replaced vacuum tubes where small, cheap, and low-power mattered. That turned out to be almost everything. Giant models are vacuum tubes — powerful, expensive, centralized. Tiny models are transistors — cheap, local, composable. They won't replace large models for general reasoning. But for pattern detection on specific data, running locally, at zero ongoing cost? Tiny models will be everywhere. The same way transistors are everywhere.

KIRI — an Eryx Labs project