The mobile security architecture
The same rule that applies to web integrations applies here: your Voxmind API token must never be embedded in your app bundle. APK and IPA files can be decompiled. Strings embedded in compiled code, including those stored as environment variables or constants, are recoverable by a determined attacker. If your token appears in the app, it is effectively public. The correct architecture is identical to the web pattern — your app sends audio to your own backend, your backend proxies the call to Voxmind with the bearer token attached server-side. The app never touches the Voxmind API directly. For the app-to-backend leg, use your normal authenticated API calls with your own session token or JWT. This means the chain of trust is: the user authenticates to your app normally (via whatever method you use for non-voice flows), your app receives a session token, and that session token is what authorises the voice enrollment or verification call to your backend. Voxmind then becomes an additional authentication layer on top of your existing auth system, not a replacement for it.iOS audio capture
On iOS, theAVAudioEngine and AVAudioRecorder APIs handle microphone input. For voice biometrics, AVAudioEngine gives you more control over the audio pipeline and is the preferred approach for production integrations.
The key iOS-specific settings are the audio session category and the sample rate. You want AVAudioSessionCategoryRecord or AVAudioSessionCategoryPlayAndRecord with AVAudioSessionModeVoiceChat — this mode tells iOS to apply voice-optimised signal processing, including acoustic echo cancellation and noise suppression that is tuned for voice rather than music. The sample rate should be 16kHz for voice biometric capture. iOS natively supports this rate and it gives Voxmind the frequency resolution it needs for reliable phoneme analysis above the telephony-grade 8kHz floor.
Android audio capture
On Android, theAudioRecord API gives you low-level PCM access, or you can use MediaRecorder for a simpler integration that outputs directly to a file in a compressed format. For voice biometrics, AudioRecord with PCM capture is preferred because it gives you control over the audio pipeline and avoids codec-introduced artefacts before the audio reaches Voxmind.
The critical Android-specific parameter is the audioSource. Use MediaRecorder.AudioSource.VOICE_RECOGNITION rather than the default MIC source. The VOICE_RECOGNITION source signals to Android’s audio subsystem that this audio is destined for speech processing — on most devices this disables noise suppression and automatic gain control at the hardware level, which sounds counterintuitive but is correct for voice biometrics. You want the raw voice signal, not a pre-processed one, so that Voxmind’s own processing can operate on clean input.
React Native
If you’re using React Native, theexpo-av library (for Expo-managed projects) or react-native-audio-recorder-player (for bare React Native) are the standard choices. Both expose the platform’s native audio APIs under the hood, so the same iOS and Android considerations above apply — you’re just invoking them through a JavaScript bridge.
The most important React Native-specific consideration is making sure you’re not capturing through a library that resamples audio to 8kHz before handing it to you. Some older React Native audio libraries default to telephony-grade sample rates. Always verify the sample rate of the audio blob before sending to your backend — a file that’s labelled as 16kHz but was actually captured at 8kHz is common enough to check for explicitly.
Background noise on mobile
Mobile users authenticate in highly variable environments — walking down the street, in a coffee shop, on public transport, in a car with road noise. This is the single biggest quality challenge for mobile voice authentication. You cannot control the environment, so you have to handle it in your application logic. The most important mitigation is real-time audio level feedback during capture. Show the user a volume indicator while they’re recording. If the level is low, show a prompt: “Speak closer to your phone.” If the level is high and variable — suggesting significant background noise — consider showing a warning: “We’re picking up a lot of background noise. For best results, try in a quieter location.” This doesn’t prevent the user from proceeding, but it sets expectations and reduces frustration when verification fails. Voxmind’s multi-stage noise pipeline handles compound noise well — the combination of spectral subtraction, phone-specific normalisations, and XLSR-300M’s exposure to diverse audio conditions during training means it performs significantly better in noisy environments than traditional MFCC-based approaches. But no system performs as well in a construction site as in a quiet room. Theinconclusive result is your signal that the audio quality was insufficient for a reliable determination — handle it as a prompt to retry rather than a failure.
Offline and low-connectivity considerations
Mobile networks are unreliable. A verification call that takes 2 seconds on wifi might time out on a poor 3G connection, or the HTTP request might fail entirely mid-upload. Build retry logic into your mobile client for audio upload failures. A reasonable retry strategy: attempt the upload, wait up to 15 seconds for a response, and if it fails or times out, retry once with a fresh recording prompt rather than resending the original audio. Stale audio from a retry of a failed attempt is worse than a fresh sample from a new recording — the user may have moved to a different environment, and the act of asking them to try again often produces better audio anyway. If your app has offline or low-connectivity use cases, voice authentication will need a fallback path. Design your authentication flow with the assumption that any given verification attempt may fail to reach the server, and make the fallback path (PIN, biometric, OTP) easily accessible without treating it as an error state.Platform-specific permission patterns
Both iOS and Android require microphone permission to be requested at runtime, but the UX patterns differ between platforms. On iOS, you get exactly one chance to show the native permission prompt. If the user denies it, you cannot show it again — they have to manually re-enable microphone access in the Settings app. This means you should make your pre-permission explanation as clear and compelling as possible before triggering the system prompt. A brief in-app screen that explains why voice authentication needs the microphone, shown immediately before the system prompt, meaningfully increases acceptance rates. If the user has previously denied permission, detect this state usingAVAudioSession.recordPermission and show a custom UI that deep-links them directly to your app’s settings page.
On Android, users can deny permission without permanently revoking it, and you can re-request after explaining why you need it (once — repeated requests are blocked after two denials). Use shouldShowRequestPermissionRationale() to determine whether to show an explanation before re-requesting.
In both cases, the in-app explanation before the system prompt is the same message: voice authentication requires the microphone, it’s only used for authentication, and no audio is stored beyond the processing required to create the voiceprint.

