I have built and abandoned four local voice assistants before getting one that actually works reliably. Not "works in a demo" reliable. "My family uses it daily without complaining" reliable. The difference between those two bars is enormous, and most guides skip straight to the fun parts (model selection, wake word training) while ignoring the boring parts that actually determine success or failure.
Here is what I learned.
The Pipeline Matters More Than Any Single Component
A voice assistant is a pipeline: wake word detection, speech-to-text, intent processing/LLM, text-to-speech, audio output. Every guide focuses on optimizing individual components. The real reliability problems are in the transitions between components.
The wake word fires but the microphone buffer has already overwritten the first half-second of speech. The speech-to-text returns text but the LLM takes 4 seconds to respond, and by then the user has walked away. The text-to-speech generates audio but the speaker was claimed by another process.
My reliable setup treats the pipeline as a state machine with explicit transitions, timeouts at every stage, and graceful degradation. If the LLM is slow, it says "let me think about that" immediately instead of sitting in silence. If speech-to-text returns low confidence, it asks for clarification instead of guessing. These are not features - they are requirements for anything you want people to actually use.
Hardware: Spend Money on the Microphone
I wasted months optimizing speech-to-text models before realizing my microphone was the bottleneck. A cheap USB microphone in a kitchen picks up the dishwasher, the TV, the dog, and then a faint human voice somewhere in the mix. No amount of model optimization fixes bad input.
A decent microphone array (I use the ReSpeaker Mic Array v2.0, about $80) with beamforming and noise suppression transformed my recognition accuracy from maybe 70% to over 95%. The beamforming focuses on the direction of the speaker, and the hardware noise suppression cleans up the signal before it even hits the software stack.
This is the single highest-impact change you can make. Better microphone hardware beats a better speech-to-text model every time.
The Stack That Works
After much iteration, here is what runs reliably:
Wake word: OpenWakeWord. Fast, runs on CPU, customizable. I trained a custom wake word in about 30 minutes using their fine-tuning pipeline. False positive rate is maybe once every two days, which is acceptable.
Speech-to-text: faster-whisper (medium model). The medium model on a decent CPU gives real-time transcription with good accuracy. The large model is marginally better but twice as slow - not worth it for interactive use. I use the medium.en variant since our household is English-only. Dropping multilingual support gives a noticeable speed boost.
Brain: Local LLM with function calling. I run a quantized 8B model for simple queries (time, weather, timers, lights) and route complex queries to a cloud API. The local model handles 80% of requests with sub-second latency. The cloud fallback handles everything else.
Text-to-speech: Piper. Fast, runs on CPU, sounds decent. Not as good as cloud TTS, but the latency is 10x better. For a voice assistant, speed of response matters more than voice quality. A slightly robotic answer in 200ms beats a beautiful answer in 2 seconds.
Audio: PulseAudio with priority routing. The voice assistant gets audio priority over music, podcasts, and other media. When it speaks, everything else ducks. When it is done, everything resumes. This is the kind of integration detail that makes the experience feel polished.
The Reliability Secret: Watchdogs Everywhere
Every component in my pipeline has a watchdog. If speech-to-text hangs for more than 5 seconds, it gets killed and restarted. If the LLM does not respond in 10 seconds, the fallback activates. If PulseAudio loses the output device, it automatically reconnects.
I run a simple health check every 60 seconds that exercises the full pipeline with a synthetic input. If any stage fails, I get a notification and the system attempts self-repair. In three months of operation, the system has self-healed from about a dozen transient failures without any manual intervention.
This is the part nobody talks about. Building a demo that works once is easy. Building a service that works every time, recovers from failures, and does not need babysitting is where the real work lives.