Skip to main content
The quality of a user’s initial enrollment directly affects how reliably they’ll be authenticated in future verifications. A voiceprint built from a clean, sufficient recording will perform well across varying conditions — different microphones, ambient noise, and the natural day-to-day variation in how someone speaks. A voiceprint built from poor audio will produce inconsistent results and frustrate users who get incorrectly rejected. This guide covers everything you need to know to capture good enrollment audio from your users.

Minimum audio requirements

Voxmind accepts WAV and MP3 audio files. The recording should be at least 3 seconds of clean speech, with 5 seconds being the sweet spot for optimal voiceprint accuracy. The sample rate should be a minimum of 16kHz — most modern recording APIs on web and mobile default to 44.1kHz or 48kHz, which is fine; Voxmind will downsample internally. What matters far more than length is the signal-to-noise ratio. Three seconds of clean speech in a quiet room produces a better voiceprint than ten seconds of speech with consistent background noise, because the phoneme-frequency extraction pipeline has to work harder to isolate clean phoneme boundaries when noise is present.

Designing your enrollment UX

The enrollment experience matters for two reasons: it affects audio quality (users who understand what you need will speak more naturally and clearly) and it affects completion rates (users who find the process confusing will abandon it). Tell users what to say. Even though Voxmind is text-independent, users benefit from a prompt. “Please say your full name and confirm today’s date” works well — it’s natural, generates varied phoneme content, and gives users something specific to focus on rather than feeling like they’re talking into the void. Use a visual indicator to show recording is active. A simple animated waveform or countdown timer signals that the system is listening and processing. Without it, users often speak too softly or stop speaking before the required duration. Validate before submitting. Record the audio client-side and do a quick client-side check on duration (is it at least 2 seconds?) and amplitude (is there actually speech present?) before you submit to the API. This catches the common failure modes — user didn’t speak, recording was too brief — before wasting an API call. Offer a re-enrollment path. Circumstances change. A user who enrolled on a phone in a quiet environment might need to re-enroll when you build a desktop app. Make it easy to update their voiceprint in your account settings flow.

Handling the async response

Enrollment returns HTTP 202 (Accepted) immediately and delivers the result to your configured webhook endpoint when processing is complete, typically within 1–3 seconds. Your webhook payload will indicate whether the enrollment was successful and whether the voiceprint quality meets the threshold for reliable verification. If the enrollment quality score is below the minimum threshold — which can happen with very short audio, very noisy recordings, or audio where no clear speech was detected — Voxmind will flag this in the webhook response. Build your flow to handle this gracefully: rather than silently failing, tell the user the enrollment didn’t capture clearly and prompt them to try again.

Re-enrollment and voiceprint updates

You can submit a new enrollment for an external_id at any time. The new recording will replace the existing voiceprint. There is no concept of accumulating multiple enrollments — each user has a single active voiceprint associated with their external_id in your organisation. This is intentional: maintaining a single current voiceprint keeps the matching model simple and avoids the complexity of managing voiceprint versions. If a user’s voice characteristics change significantly — which is rare but can happen after surgery, illness, or significant aging — re-enrollment resolves it cleanly.

Multi-language enrollment

Voxmind is language-agnostic and is text-independent, which means a user can enroll in one language and verify in another with no degradation in accuracy. However, for the best voiceprint quality, it’s good practice to have users enroll in the language they’re most likely to speak during verification. This is a marginal difference rather than a functional one, but it’s worth noting for deployments where users speak multiple languages within your application. See the Language Support guide for the full list of supported languages and ISO codes.