Voice Can Make Coding Agents Better (In Some Cases!)

A few days ago I added voice input and output to the coding agent I use daily. I expected it to be a convenience thing. What actually happened is that the agent started giving better results. Same model, same prompts, same tools. The only thing I changed was that I was talking instead of typing.

Let me explain what I mean by that. As the agents have been getting more autonomous over the past few months, the way I interact with them changed too. I'm not giving them step-by-step instructions anymore. Most of the time I'm trying to explain what I'm thinking, like my mental model of the problem. Over the weekend, I was debugging a websocket reconnection issue, and what I actually wanted to say was "I think this is related to the token refresh changes we made last week, the reconnect fires before the new token comes back and I only see it fail after expiry, not on cold start." This was what I was thinking, but I didn't type all of that. I just wrote "fix the reconnect logic in ws_client" and moved on. When you type, you trim everything down. It's automatic.

Typed: 7 words. Spoken: the full thought.

With voice I held Space and said the whole thing. Everything I was thinking. The agent picked up on the token refresh timing and found the race condition on its own. If I had typed my usual five-word version it would've just looked at the reconnect function in isolation and missed the connection, or maybe it would fix it but it would either take too long or it would not find the root cause.

There's actually research on why this happens. When you type you're simultaneously planning what to say, doing the motor actions, watching for typos, editing as you go. Kellogg's work at Saint Louis University shows these compete for the same working memory resources. Speaking doesn't have that problem because the motor part is basically automatic by adulthood, so all your working memory goes into the actual thinking. A study on spoken vs written learning journals found that spoken entries were about four times longer than written ones, and the extra length wasn't filler. People naturally include more of their reasoning instead of trimming it down. When I type a prompt I'm compressing. When I talk I'm thinking out loud.

It's the same pattern with research too. I use the agent to go through papers and technical docs pretty often, and the follow-up is where voice helps most. I'll hold Space and say something like "ok the first paper's approach seems way more practical but I think we could combine it with the caching strategy from the third one, can you sketch out what that would look like in our codebase?" That's maybe 40 words of context I would never type out. But saying it takes five seconds.

There are limits though. There's evidence that forced verbalization can hurt performance on insight problems, the kind where you need to sit with something and let the answer come to you. Voice helps most with analytic, structured work where more context is better. For the "stare at the ceiling" moments, I just type instead. I hold Space when I want to talk, or I type normally. Both work in the same session, no mode switching.

I added voice output too. Initially the agent just spoke while the text sat there on screen, and I caught myself zoning out immediately. So I changed it. Now the words highlight one by one as the agent speaks, synced with the audio. This way I actually follow along. There's a study from 2015 that tested exactly this kind of karaoke-style highlighting and found it improved semantic memory by about 11 percentage points. The key finding is that pacing and synchronization matter more than just "adding audio." I can also interrupt mid-sentence by holding Space while it's talking. It stops, I start talking. Without that it feels like listening to a technical voicemail : )

Words highlight one by one as the agent speaks, synced with the audio

I think for coding agents, interface design might matter as much as model choice. How much of your thinking actually makes it into the prompt depends a lot on the interface. We just wanted to make it easier for people to say what they're actually thinking.

The agent is called ata. It's open source, built on OpenAI's Codex CLI.

npm install -g @a2a-ai/ata

github.com/Agents2AgentsAI/ata