A Pulse atom is 27,840 numbers. An ESP32-S3 has room for 131,000. The forward pass is multiply-and-add. A $3 microcontroller and 200 lines of C.
Can a tiny transformer run on a microcontroller? Yes, with massive headroom.
| What | Number | Why it matters |
|---|---|---|
| Pulse atom parameters | 27,840 | Each is an int16 = 55,680 bytes = 54 KB |
| ESP32-S3 SRAM | 512 KB | 55 KB model + scratch space = fits 9× over |
| Forward pass operations | ~200 cycles | Matrix multiply + add. No division, no transcendentals in critical path |
| Inference time (240 MHz) | <1 ms | Score one observation in under a millisecond |
| Chip cost | ~$3 | ESP32-S3-WROOM-1 module, volume pricing |
A transformer, viewed as hardware. Every software operation maps to a hardware unit. The CPU in the ESP32 already has all of these.
| Unit | Software Operation | Hardware | Cycles |
|---|---|---|---|
| SRAM | Weight storage (wte, wpe, layers) | On-chip SRAM bank | 1 (read) |
| MAC Array | Matrix multiply (linear layers) | Multiply-accumulate unit | N×M per matrix |
| ReLU Comparator | max(0, x) | Single comparison + mux | 1 |
| Softmax LUT | exp(x) / sum(exp) | Lookup table + accumulator | V (vocab size) |
| RMSNorm | x / sqrt(mean(x²)) | Accumulator + reciprocal sqrt | N (embed dim) |
| Sequencer | Token-by-token processing loop | State machine controller | 1 per token |
| I/O | Sensor read + score output | ADC/I2C/SPI + UART/LED | Variable |
// Processing one token through the Pulse atom (2 layers, 32-dim, 4 heads): 1. Embedding lookup 2 reads × 32 dims = 64 ops 2. RMSNorm 32 multiply + 32 add + sqrt = ~70 ops 3. Attention Q,K,V 3 × (32×32 MAC) = 3,072 ops 4. Score + softmax 4 heads × T scores = ~128 ops 5. Weighted sum + O proj 32×32 MAC = 1,024 ops 6. Residual add 32 adds = 32 ops 7. MLP (expand + ReLU) 128×32 MAC + 128 relu = 4,224 ops 8. MLP (compress) 32×128 MAC = 4,096 ops 9. Residual add 32 adds = 32 ops 10. ×2 layers (steps 2-9) × 2 = ×2 11. lm_head + softmax 43×32 MAC + 43 exp = 1,419 ops ──────────────────────────────────────────────────────── Total: ~27,000 operations per token, ~160,000 for a 6-token sequence At 240 MHz (ESP32-S3): <1 ms inference time
Detailed breakdown of whether the Pulse atom fits on an ESP32-S3-WROOM-1 module.
| Resource | Available | Required | Headroom |
|---|---|---|---|
| SRAM | 512 KB | ~55 KB (weights as int16) | 9.3× |
| Flash | 4–16 MB | ~56 KB (weight file) | 70×+ |
| Clock | 240 MHz (Xtensa LX7) | ~160K ops per inference | <1 ms |
| RAM (scratch) | ~400 KB free after OS | ~8 KB (intermediates + KV cache) | 50× |
| Power | 3.3V, ~240 mA active | Inference burst <1 ms | Sleep between readings |
| Cost | — | ~$3 (module) | $8–15 total BOM |
A real board layout for a KIRI sensor node. 4-layer PCB, 50mm × 35mm.
| Component | Part | Est. Cost |
|---|---|---|
| MCU Module | ESP32-S3-WROOM-1 (4MB flash) | $3.00 |
| USB-C Connector | USB-C 2.0 receptacle | $0.30 |
| Voltage Regulator | AMS1117-3.3 (SOT-223) | $0.15 |
| Status LEDs (3x) | 0603 Green/Amber/Red | $0.10 |
| Resistors (6x) | 0603 assorted | $0.05 |
| Capacitors (4x) | 0603 100nF + 10μF | $0.10 |
| Sensor Headers | 2.54mm 2x4 pin header | $0.20 |
| PCB | 4-layer, 50×35mm (5 pcs) | $1.50 |
| Total | ~$5.40 |
The forward pass in C. Fixed-point int16 arithmetic — no floating point needed. Train on computer (Python), export weights as binary, flash to ESP32.
// kiri_atom.h — Fixed-point transformer forward pass // Weights are int16_t, scaled by 2^10 (10 fractional bits) #include <stdint.h> #define N_EMBD 32 #define N_HEAD 4 #define N_LAYER 2 #define VOCAB 43 #define BLOCK_SZ 16 #define HEAD_DIM (N_EMBD / N_HEAD) // 8 #define FRAC_BITS 10 #define SCALE (1 << FRAC_BITS) // 1024 // Weight matrices (stored in flash, loaded to SRAM on boot) typedef struct { int16_t wte[VOCAB][N_EMBD]; // 43 × 32 = 1,376 int16_t wpe[BLOCK_SZ][N_EMBD]; // 16 × 32 = 512 int16_t lm_head[VOCAB][N_EMBD]; // 43 × 32 = 1,376 // Per layer: wq, wk, wv, wo, f1, f2 int16_t wq[N_LAYER][N_EMBD][N_EMBD]; int16_t wk[N_LAYER][N_EMBD][N_EMBD]; int16_t wv[N_LAYER][N_EMBD][N_EMBD]; int16_t wo[N_LAYER][N_EMBD][N_EMBD]; int16_t f1[N_LAYER][4*N_EMBD][N_EMBD]; int16_t f2[N_LAYER][N_EMBD][4*N_EMBD]; } AtomWeights; // Fixed-point linear layer: out[nout] = W[nout][nin] @ x[nin] static void linear(int32_t *out, const int16_t *W, const int32_t *x, int nout, int nin) { for (int i = 0; i < nout; i++) { int32_t acc = 0; for (int j = 0; j < nin; j++) { acc += (int32_t)W[i * nin + j] * (x[j] >> (FRAC_BITS/2)); } out[i] = acc >> (FRAC_BITS/2); // keep in Q10 range } } // ReLU: max(0, x) static void relu(int32_t *x, int n) { for (int i = 0; i < n; i++) if (x[i] < 0) x[i] = 0; } // Score one observation: returns anomaly score (fixed-point) int32_t atom_score(const AtomWeights *w, const uint8_t *tokens, int n_tokens) { int32_t x[N_EMBD], xr[N_EMBD], tmp[4*N_EMBD]; int32_t total_score = 0; // KV cache for attention int32_t k_cache[N_LAYER][BLOCK_SZ][N_EMBD]; int32_t v_cache[N_LAYER][BLOCK_SZ][N_EMBD]; for (int pos = 0; pos < n_tokens - 1; pos++) { // Embedding: x = wte[token] + wpe[pos] for (int j = 0; j < N_EMBD; j++) x[j] = ((int32_t)w->wte[tokens[pos]][j] + (int32_t)w->wpe[pos][j]) << (FRAC_BITS/2); // rmsnorm(x) — normalize rmsnorm(x, N_EMBD); // Transformer layers for (int li = 0; li < N_LAYER; li++) { memcpy(xr, x, sizeof(x)); // save for residual rmsnorm(x, N_EMBD); // Attention: Q*K^T/sqrt(d) -> softmax -> *V // ... (multi-head attention with KV cache) ... // MLP: expand -> relu -> compress linear(tmp, (int16_t*)w->f1[li], x, 4*N_EMBD, N_EMBD); relu(tmp, 4*N_EMBD); linear(x, (int16_t*)w->f2[li], tmp, N_EMBD, 4*N_EMBD); // Residual connection for (int j = 0; j < N_EMBD; j++) x[j] += xr[j]; } // lm_head -> score int32_t logits[VOCAB]; linear(logits, (int16_t*)w->lm_head, x, VOCAB, N_EMBD); // -log(softmax(logits)[target]) = surprise total_score += neg_log_softmax(logits, tokens[pos+1], VOCAB); } return total_score / (n_tokens - 1); }
# Python: export weights as binary for ESP32 import struct, json with open('pulse_weights.json') as f: data = json.load(f) with open('pulse_weights.bin', 'wb') as f: for name in ['wte','wpe','lm_head','l0.wq','l0.wk', ...]: for row in data['weights'][name]: for val in row: # Quantize float to int16 (Q10 fixed-point) q = int(round(val * 1024)) q = max(-32768, min(32767, q)) f.write(struct.pack('<h', q)) # Flash to ESP32: # esptool.py write_flash 0x100000 pulse_weights.bin
A KIRI node is a $6 board that learns what "normal" looks like for any sensor and alerts on anomalies. No cloud, no subscription, no rules to write. Here's where that matters.
Sensors: Voltage, temperature, router CPU, link quality
Tokens: V0-V4 T0-T9 C0-C9 Q0-Q4
Detects: Power supply degradation, overheating patterns, unusual traffic at 3am, link flapping before total failure. The model learns each tower's baseline and flags deviations specific to that site.
Sensors: Motor current (CT clamp), vibration (accelerometer), flow rate
Tokens: I0-I9 V0-V9 F0-F4
Detects: Bearing wear (vibration increases weeks before failure), dry running (flow drops but current stays high), blockages. A $6 node saves a $500 pump.
Sensors: DC input voltage, AC output, efficiency ratio, temperature
Tokens: D0-D9 A0-D9 E0-E4 T0-T9
Detects: Panel degradation (gradual efficiency drop), inverter faults (output waveform anomalies), shading patterns. Learns diurnal patterns and flags real problems vs normal cloud cover.
Sensors: Pressure transducer, flow meter, chlorine sensor
Tokens: P0-P9 F0-F9 C0-C4
Detects: Pipe leaks (pressure drops), demand surges, treatment system failures. Each node monitors a section of pipe. Anomaly at node 3 but not node 4 = leak between them.
Sensors: Current transformer on power line
Tokens: W0-W9 (power draw buckets)
Detects: Fridge compressor degradation (power draw increases over months), washing machine bearing wear, heater element failure. A single CT clamp per appliance, $6 per monitor.
Sensors: Vibration (ADXL345), current (ACS712), temperature (NTC)
Tokens: V0-V9 I0-I9 T0-T9
Detects: Bearing wear signatures, belt slippage, overload conditions. The model learns each motor's vibration fingerprint. Anomaly = maintenance needed before catastrophic failure.
From proof-of-concept to deployed hardware in four stages.
Get an ESP32-S3 dev board (~$8). Port the forward pass to C. Load weights exported from Python. Run the same test observations through both Python and C. Compare scores — they should match within rounding tolerance (int16 quantization introduces small differences). If scores match, the port works.
// ESP32 Arduino setup: #include "kiri_atom.h" #include "pulse_weights.h" // generated binary header void setup() { Serial.begin(115200); uint8_t tokens[] = {0, 5, 17, 24, 31, 36, 42}; // BOS C5 M7 D4 S1 L1 N1 int32_t score = atom_score(&weights, tokens, 7); Serial.printf("Score: %d.%03d\n", score/SCALE, (score%SCALE)*1000/SCALE); }
Wire an I2C or ADC sensor to the dev board. Collect 24 hours of data (one reading every 30 seconds = 2,880 observations). Transfer to computer. Train an atom on the data. Export weights back to ESP32. Now the device can score its own sensor readings in real time.
Open KiCad. Place the ESP32-S3 module, USB-C for power/programming, voltage regulator, sensor headers, and status LEDs. Route on a 4-layer board (signal/ground/power/signal). Generate Gerber files. Upload to a PCB manufacturer. 5 boards for ~$8 shipped.
Mount the node where the sensor needs to be. Power via USB-C or battery. Let it collect for 48 hours (building a baseline). Train an atom on the collected data. Flash the weights back. The node now runs independently — collecting, scoring, and alerting over WiFi when anomalies are detected. No cloud subscription. No ongoing cost.
The pattern repeats at every level of computing. Each level = many of the level below, hiding complexity and enabling new capabilities.
| Level | Unit | Made of | New capability |
|---|---|---|---|
| 1 | Transistor | Silicon + dopants | Amplification, switching |
| 2 | Logic Gate | ~6 transistors | Boolean logic (AND, OR, NOT) |
| 3 | CPU | ~1B gates | General computation |
| 4 | Atom | ~28K parameters on a CPU | Pattern recognition, anomaly detection |
| 5 | Molecule | Multiple atoms + Nerve | Cross-domain reasoning, decisions |
| 6 | Organism | Many molecules, networked | Distributed intelligence, adaptation |
Cost: $100M+ to train, $0.01+ per inference
Location: Data center
Capability: General-purpose, impressive at everything
Ownership: Rented access
Privacy: Data leaves your network
Cost: $0 to train, $0 per inference
Location: On each device
Capability: Specialist, excellent at one thing
Ownership: Yours completely
Privacy: Data never leaves the device