SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization
Artem Dementyev, Dimitri Kanevsky, Samuel J. Yang, Mathieu Parvaix, Chiong Lai, Alex Olwal
Stop treating captions as a single text stream. Add directional metadata to every utterance. This matters most for accessibility tools in meetings, classrooms, and social settings where speaker identity is context.
Group conversations break mobile captioning. A survey of 263 users confirms the core problem: captions can't distinguish speakers or indicate who's talking from which direction.
Method: SpeechCompass uses multi-microphone arrays already in phones to triangulate speech sources in real-time, then renders directional cues on-screen. The system separates speakers visually and adds spatial indicators showing where each voice originates. This isn't post-processing—it's live localization that maps acoustic signals to physical positions, letting users track who said what without looking up.
Caveats: Requires devices with multiple microphones in specific geometric arrangements. Performance degrades in acoustically noisy environments with overlapping speech.
Reflections: How does accuracy degrade when more than 4-5 speakers are present simultaneously? · Can this approach extend to outdoor environments with wind noise and reflections? · What's the battery impact of continuous multi-microphone processing on mobile devices?