Building a Voice Bridge: When the API Lies to You
Project: OpenClaw Voice Plugin
The Goal: Get a "demonic" AI voice speaking through my Raspberry Pi speakers
The Reality: 3 hours of authentication hell and encoding surprises
The Vision
I wanted Squidworth to speak. Not just text responses - actual voice, through speakers, with personality. I'd spent months perfecting the Hades voice persona with Inworld. Deep, slightly menacing, occasionally bemused. Perfect for a cosmic horror assistant.
The plan seemed simple:
- Inworld TTS API → audio
- Raspberry Pi → speakers
- HTTP endpoint → trigger
What could go wrong?
Attempt 1: JWT Authentication
Inworld's docs said "use JWT." So I implemented JWT. Generated tokens, set the audience, the secret, the key... all the standard OAuth2-ish dance.
// The code looked right
const token = generateJWT({
aud: process.env.INWORLD_JWT_AUD,
key: process.env.INWORLD_JWT_KEY,
secret: process.env.INWORLD_JWT_SECRET,
});
Result: 401 Unauthorized
I checked the token. It was valid. I checked the headers. They were correct. I regenerated the credentials. Still failed.
After an hour of debugging, I did what every developer does: tried random variations until something worked.
Attempt 2: Basic Auth (The "Wrong" Way)
Buried in Inworld's dashboard: a "Basic Auth" option. Base64 encode of key:secret. No JWT. No token generation. Just a simple header.
const auth = Buffer.from(`${key}:${secret}`).toString('base64');
headers['Authorization'] = `Basic ${auth}`;
Result: 200 OK
It worked. The "legacy" auth method worked when the "modern" JWT approach failed. I still don't know why. The JWT token was valid according to every debugger I tried. But Basic Auth? Instant success.
Lesson #1: Documentation lies. Sometimes the "old" way is the only way.
Attempt 3: The Audio Isn't Audio
With auth working, I expected an MP3 stream. That's what the API docs implied. Raw audio bytes.
Instead, I got JSON:
{
"audioContent": "//NExAAAAANIAAAAAExBTUUzLjEwMKqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq//NExAAAAANIAAAAAExBTUUzLjEwMKqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq"
}
Base64-encoded audio. Embedded in JSON. Not a stream. Not raw bytes. A string. Inside an object.
I spent 20 minutes trying to pipe this "audio" to VLC before realizing I needed to decode it first.
// The fix
const audioBuffer = Buffer.from(response.audioContent, 'base64');
fs.writeFileSync('/tmp/output.mp3', audioBuffer);
Lesson #2: APIs return what they return, not what the docs say they return.
The Voice Bridge Architecture
The final solution was elegant in its simplicity:
HTTP POST (text)
→ Node.js server
→ Inworld TTS API (Basic Auth)
→ Base64 decode
→ Save MP3
→ VLC (cvlc) playback
→ USB speakers
A standalone voice bridge that any system can use. Not tied to OpenClaw. Not tied to any specific integration. Just: POST /speak and hear the voice.
The Moment of Truth
1:47 AM. I'd been debugging for hours. Squidworth (my AI assistant) had patiently executed dozens of commands, each one failing slightly differently.
I ran the test:
curl -X POST http://localhost:3000/speak \
-H "Content-Type: application/json" \
-d '{"text": "Hello Joel, this is Squidworth"}'
Silence. Then:
"Hello Joel, this is Squidworth."
The Hades voice. Deep. Resonant. Perfectly, slightly menacing.
I may have yelled "YES!" at 1:47 AM and woken up the neighborhood.
Production vs Personal
Here's the thing: this voice bridge works. It's running as a systemd service. It auto-starts on boot. It's reliable.
But it's not "production code." It doesn't have:
- Proper error recovery
- Rate limiting
- Audio queue management
- Multiple concurrent request handling
It works for me. For my single Pi. For my single voice assistant.
But I'm building this as an OpenClaw plugin - something the community will use. So the voice bridge is just a prototype. The real work is making it robust, tested, documented.
The tension: I want to ship. I want to hear Squidworth speak in every room. But this needs to be right, not just working.
What's Next
- STT: Wake word detection + Whisper for hands-free input
- OpenClaw integration: Native channel registration
- Home Assistant: Voice-controlled lights
- Quality: Tests, docs, error handling
The voice works. Now I need to build the system around it.
Next week: Why I'm rewriting perfectly working code to meet "community standards."