JARVIS: Beating Cloud Latency with ESP32 and Local AI
Architecting a fully autonomous, local-first voice assistant running on an ESP32 microcontroller with ASR, LLM reasoning, a 5-layer persistent vector memory system, and MCP protocol integration.
The Latency Trap of the Smart Home
IoT hardware startups rely heavily on third-party voice assistants (Alexa, Google Home) that pipe raw audio to the cloud. This creates an inherently flawed user experience. The round-trip sequence involves: capturing audio on the device → sending it to a cloud server → transcribing it (STT) → querying an LLM → generating speech (TTS) → streaming it back down to the microcontroller.
This process consistently introduces a 3- to 5-second delay, shattering the illusion of intelligence. Furthermore, these assistants suffer from severe "amnesia," lacking deep, persistent contextual memory across long periods of interaction.
The Goal: A "Savior" Architecture for Edge AI
The objective for JARVIS was twofold: drop the latency to near-human conversational speeds (<500ms) on a constrained $5 ESP32 chip, and equip the agent with a persistent memory architecture that remembers conversations from weeks ago.
1. The Hardware-Software Bridge (C++ & WebSockets)
Instead of relying on clunky HTTP polling, I programmed the ESP32 in C++ to capture raw I2S audio and stream it continuously via a highly optimized, low-latency WebSocket connection. A local Python server ingested this stream in real-time. By keeping the connection open and buffering the audio at the edge, the overhead of establishing TLS handshakes for every query was entirely eliminated.
A critical engineering hurdle was handling "Barge-ins" (interruptions). When a user spoke while JARVIS was answering, the C++ client instantly detected the amplitude spike on the microphone, sent a "HALT" packet over the WebSocket, and the Python server immediately killed the TTS generation. This allowed the agent to ingest the new context naturally, without losing its train of thought.
2. The Orchestration Backend (Python & Local AI)
As audio packets hit the Python server, a local instance of OpenAI's Whisper model transcribed the speech almost instantaneously. This text was fed into the Gemini reasoning engine. To give the assistant real-world agency, I implemented the Model Context Protocol (MCP), allowing the LLM to trigger physical hardware actions (like turning on lights or fetching local weather APIs) natively.
3. The 5-Layer Persistent Memory (ChromaDB)
To solve the amnesia problem without overflowing the LLM’s context window, I architected a multi-tiered memory system:
- Short-Term: The immediate conversation history kept in RAM.
- Working Memory: A rolling summary of the current session.
- Long-Term (ChromaDB): Background workers summarized completed sessions and embedded the facts, preferences, and key events into a local vector database. When a user asked, "What was that book I mentioned last week?", the Python backend executed a semantic vector search before prompting the LLM, retrieving the exact context instantly.
Business Impact
This architecture proved that high-end, memory-persistent AI agents can be deployed on extremely constrained, cost-effective edge hardware. By shifting STT/TTS routing to a dedicated local server and maintaining a 5-layer vector memory, JARVIS achieved an end-to-end response latency of under 500ms, fundamentally outperforming standard cloud APIs.
Building IoT hardware that needs real-time AI reasoning?
I build C++ & Python bridges that bring sub-second intelligence to edge devices without relying entirely on slow cloud APIs.
Discuss your edge AI constraints mythonggg@gmail.com