Abstract :
[en] This paper proposes an audio-visual deepfake detection approach that aims to
capture fine-grained temporal inconsistencies between audio and visual
modalities. To achieve this, both architectural and data synthesis strategies
are introduced. From an architectural perspective, a temporal distance map,
coupled with an attention mechanism, is designed to capture these
inconsistencies while minimizing the impact of irrelevant temporal
subsequences. Moreover, we explore novel pseudo-fake generation techniques to
synthesize local inconsistencies. Our approach is evaluated against
state-of-the-art methods using the DFDC and FakeAVCeleb datasets, demonstrating
its effectiveness in detecting audio-visual deepfakes.
Scopus citations®
without self-citations
0