Skip to main content
Of all the things Voxmind does, deepfake detection is the capability most developers want to understand deeply before they put it in front of real users. This guide explains how it works, what it catches, what the numbers mean, and how to build your application logic around it correctly.

Why deepfake detection is now non-negotiable

Until around 2021, “voice spoofing” mostly meant one thing: someone recorded your voice and played it back to fool a biometric system. Replay attacks are relatively straightforward to detect — recorded audio has characteristic compression artefacts, microphone-room response signatures, and other tell-tale marks that a trained model can identify. The threat landscape changed when high-quality neural voice cloning became accessible. Tools like XTTS, Tortoise TTS, and dozens of commercially available services can now generate a convincing voice clone from as little as 3 seconds of source audio, in real time, for free. The resulting synthetic audio doesn’t have the artefacts of a replay attack — it’s freshly generated, at high quality, and acoustically similar enough to the target voice to fool both human listeners and traditional voice biometric systems that weren’t designed with this threat in mind. The FBI’s IC3 unit has documented a sharp increase in voice-based social engineering attacks using AI-generated audio, particularly targeting contact centres and financial institutions where voice is used as an authentication factor. Voxmind was built knowing this threat exists. Deepfake detection isn’t a feature we added later — it runs on every single verification call, automatically, with no additional integration work required on your part.

What Voxmind actually detects

Voxmind’s deepfake detection catches three distinct categories of attack, and it’s worth understanding each one. AI voice clones are the primary modern threat. These are voices generated by neural text-to-speech or voice conversion models that have been conditioned on samples of the target user’s voice. The generator models — typically based on diffusion, GAN, or autoregressive architectures — learn to reproduce the acoustic surface of the target voice but cannot replicate the underlying biomechanical relationships that Voxmind’s phoneme analysis measures. That’s the voiceprint mismatch side. On the deepfake detection side, neural-generated audio carries statistical signatures in the frequency domain that are distinct from human-produced speech — subtle but consistent artefacts in how spectral energy is distributed across frames. Voxmind’s AASIST model is specifically trained to identify these signatures across a wide range of synthesis architectures. Replay attacks involve recording authentic audio from the target user — from a phone call, a public video, a voicemail — and playing it back during a verification attempt. Replay attacks produce a different set of artefacts: the acoustic fingerprint of the recording device and playback environment, slight temporal smearing from digital-to-analogue and analogue-to-digital conversion, and characteristic room impulse responses. These are well-understood signals that the detection model identifies reliably. Voice conversion attacks are somewhere in between: a live human voice is run through a real-time conversion model that shifts its characteristics toward the target’s voice. This is technically more demanding for an attacker and produces a third distinct artefact profile — the residual characteristics of the source voice bleed through the conversion, and the spectral boundaries between phonemes have a characteristic smoothness that differs from natural speech.

The technology: AASIST

Voxmind’s deepfake detection is built on AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks), a state-of-the-art anti-spoofing architecture that won the ASVspoof 2021 challenge — the leading academic benchmark for voice anti-spoofing. It’s worth understanding why AASIST outperforms simpler approaches, because the reason is directly connected to why deepfake audio is hard to detect in the first place. The core insight of AASIST is that the artefacts left by synthetic audio are not localised in either time or frequency alone. A spectrogram of synthetic speech might look convincing in any given short window — the frequency content is right, the energy distribution looks natural. But the relationships between spectral and temporal patterns across the full audio signal tell a different story. Human speech has complex dependencies between what’s happening at different time points and different frequency bands simultaneously. Neural synthesis models approximate these dependencies but not perfectly. AASIST models these relationships using a graph attention network where nodes represent different spectro-temporal regions of the audio and edges represent learned relationships between them. The model learns which relationships are diagnostic of authentic versus synthetic speech and attends to them accordingly. This makes it substantially more robust to the kind of adversarial optimisation that can fool simpler classifiers — an attacker who optimises to defeat a frequency-domain classifier can inadvertently fix the temporal artefacts while introducing new graph-level artefacts that AASIST catches. The practical result is that AASIST generalises well to voice cloning architectures it wasn’t explicitly trained on, which matters because the synthesis model landscape is evolving rapidly. You don’t want a deepfake detector that only works against last year’s cloning tools.

The numbers: what a sub-0.1% false positive rate means in practice

Voxmind’s false positive rate — the rate at which genuine human speech is incorrectly flagged as synthetic — is under 0.1%. This is the number that matters most for your integration design, so let’s unpack what it actually means for your users. A false positive rate of 0.1% means that in every 1,000 legitimate verification attempts by real, enrolled users, fewer than 1 will be incorrectly flagged as a deepfake. At a contact centre processing 50,000 calls per month, that’s fewer than 50 false deepfake flags across the entire month — about 1-2 per day — all of which can be recovered through a fallback authentication path. Compare this to false positive rates in the 1–5% range that are common in less specialised approaches: at 1%, that same 50,000-call contact centre would generate 500 false flags per month, which is a meaningful customer experience problem. At a sub-0.1%, it’s a manageable edge case rather than a systematic friction point. The false positive rate was measured against a diverse test set spanning multiple languages, microphone types (mobile, landline, VOIP, headset), acoustic environments (office, home, outdoor, IVR), and demographic groups including speakers with accents, older voices, and voices affected by illness. The number is not cherry-picked from ideal lab conditions.
The false positive rate quoted here is the rate for the deepfake detection check specifically. The overall verification false reject rate — cases where a genuine enrolled user fails verification for any reason including voiceprint mismatch — is a separate metric that depends on your configured match score threshold. See Enrollment Best Practices for guidance on threshold configuration.

How the parallel pipeline works

A key design decision in Voxmind is that deepfake detection runs in parallel with voiceprint matching, not sequentially. This has two implications worth understanding. The first is latency: you don’t pay an additive time cost for deepfake detection. The total verification time (under 2 seconds) covers both checks running simultaneously on your audio. There’s no “deepfake detection mode” to enable — it’s always on, always parallel, no trade-off. The second is independence: the two checks can disagree, and both signals matter. The most important case is when a voice clone produces a non-trivial match score but deepfake detection fires — the attacker’s clone is acoustically similar enough to the target to produce a partial voiceprint match, but the AASIST model identifies the audio as synthetic. This is precisely the attack scenario where a deepfake detection check is essential: a match-score-only system might pass this attempt. Voxmind’s design treats a positive deepfake detection as a definitive rejection regardless of the match score. The second interesting case is when deepfake detection is clean (deepfake_detected: false) but the match score is below your threshold. This is a normal failed verification — the person speaking is genuinely human, just not the enrolled user. Handle it as a standard authentication failure rather than a security event.

Building your application logic correctly

The webhook result gives you three signals: result (verified/rejected/inconclusive), match_score (0.0–1.0), and deepfake_detected (boolean). The right decision tree for your application logic is straightforward but it’s worth being explicit about the priority order. The deepfake flag should be evaluated first and should be treated as absolute. If deepfake_detected is true, the verification attempt is rejected — full stop. Don’t apply the match score threshold. Don’t offer a retry. Log it as a security event with the request_uuid, external_id, and timestamp. Whether or not the match score was high, the audio was synthetic, and that’s a red flag that warrants investigation rather than a fallback authentication path.
// Correct priority order in your webhook handler
async function handleVerificationResult(result) {
  const { request_uuid, external_id, result: outcome, match_score, deepfake_detected } = result;

  // Deepfake check is always evaluated first — no exceptions
  if (deepfake_detected) {
    await logSecurityEvent({
      type: 'DEEPFAKE_DETECTED',
      external_id,
      request_uuid,
      match_score,    // Log this too — a high score here is especially notable
      timestamp: new Date().toISOString()
    });
    
    // Reject. Don't offer voice retry. Offer alternative auth or human review.
    return { decision: 'REJECTED', reason: 'synthetic_audio_detected' };
  }

  // Only evaluate match score once deepfake check is clean
  if (outcome === 'verified' && match_score >= YOUR_THRESHOLD) {
    return { decision: 'APPROVED' };
  }
  
  if (outcome === 'inconclusive') {
    // Audio quality too low for reliable determination
    // Safe to offer one retry with a request for clearer audio
    return { decision: 'RETRY', reason: 'insufficient_audio_quality' };
  }

  // Clean audio, real human, wrong voice — standard auth failure
  return { decision: 'REJECTED', reason: 'voice_mismatch' };
}
Notice the separation between DEEPFAKE_DETECTED and standard voice_mismatch rejections. These are fundamentally different events from a security operations perspective. A voice mismatch might be a genuine user having a bad connection — annoying but benign. A deepfake detection is a potential fraud attempt and should trigger different downstream logic: account flag, fraud team notification, potentially a temporary account lock depending on your risk policy.

Choosing the right match score threshold

Your match score threshold is a dial between security and convenience. Higher thresholds mean fewer false accepts (better security) but more false rejects (more friction for legitimate users). The right number depends on your use case. For contact centre authentication replacing a knowledge-based question, a threshold of 0.80–0.85 is typically appropriate. The security improvement over KBA is dramatic regardless of where you set it in this range, and the false reject rate at 0.80 is low enough that the overall authentication experience is meaningfully better than what it replaces. For step-up authentication on high-value transactions — a large bank transfer, account recovery, changing contact details — a threshold of 0.90–0.92 is more appropriate. Legitimate users making high-stakes requests are typically in a calmer environment with better audio conditions, so the higher threshold has less impact on genuine users while materially raising the bar for an attacker. For continuous authentication use cases, where verification is running periodically throughout a session, a lower threshold like 0.75 makes sense — false rejects mid-session are very disruptive, and the continuous nature of the monitoring means a single low-confidence frame doesn’t determine the outcome. Your application logic should look at the trend across multiple verifications rather than any single result.

What deepfake detection doesn’t cover

Being clear about the boundaries of any security system is more useful than overstating its capabilities. Voxmind’s deepfake detection is designed for audio that has been synthetically generated or replayed. It doesn’t solve social engineering attacks where an attacker convinces a legitimate user to authenticate on their behalf — no biometric can. It also doesn’t protect against scenarios where the attacker has somehow compromised the audio channel between the user’s device and your server, bypassing the capture stage entirely. Standard transport security (TLS, certificate pinning on mobile) covers that attack surface separately. The detection model is retrained periodically as new synthesis architectures emerge. The AASIST approach generalises well, but no detection system is infinitely future-proof against novel synthesis techniques. Voxmind monitors emerging synthesis models and updates the detection model proactively. If you need advance notice when model updates ship, subscribe to the changelog at docs.voxmind.ai/resources/changelog.

Frequently asked questions

“What happens if a user genuinely has a very unusual voice that the model hasn’t seen before?” Unusual voices — distinctive accents, speech impediments, very high or low fundamental frequency — affect the voiceprint matching model, not the deepfake detection model. The deepfake detector is evaluating properties of how the audio was generated, not what the voice sounds like. An unusual voice is just as reliably identified as authentic human speech. “Can a sophisticated attacker defeat the detection if they know Voxmind is in use?” The AASIST model is trained on a broad distribution of synthesis artefacts, including scenarios where the attacker is aware of and trying to defeat detection. Adversarial optimisation against frequency-domain features tends to fix those features while introducing artefacts at the graph-relational level. This is an active area of research and Voxmind’s model is updated accordingly. The key point is that the bar for defeating the system is substantially higher than for defeating a legacy voice biometric, and the combination of voiceprint matching AND deepfake detection means an attacker needs to beat both simultaneously. “Should I tell users that deepfake detection is running?” Yes, for compliance reasons in most regulated industries. Something simple like “this call may use voice biometrics for authentication and fraud detection” is standard practice and increasingly required by regulation. It also has a deterrent effect — knowing that synthetic audio detection is active discourages lower-sophistication attacks.