[en] In this paper, we propose an enhanced audio-visual deep detection method.
Recent methods in audio-visual deepfake detection mostly assess the
synchronization between audio and visual features. Although they have shown
promising results, they are based on the maximization/minimization of isolated
feature distances without considering feature statistics. Moreover, they rely
on cumbersome deep learning architectures and are heavily dependent on
empirically fixed hyperparameters. Herein, to overcome these limitations, we
propose: (1) a statistical feature loss to enhance the discrimination
capability of the model, instead of relying solely on feature distances; (2)
using the waveform for describing the audio as a replacement of frequency-based
representations; (3) a post-processing normalization of the fakeness score; (4)
the use of shallower network for reducing the computational complexity.
Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of
the proposed method.
Research center :
ULHPC - University of Luxembourg: High Performance Computing
Disciplines :
Computer science
Author, co-author :
ASTRID, Marcella ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > CVI2
GHORBEL, Enjie ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > CVI2 > Team Djamila AOUADA
AOUADA, Djamila ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > CVI2
External co-authors :
no
Language :
English
Title :
Statistics-aware Audio-visual Deepfake Detector
Publication date :
October 2024
Event name :
IEEE International Conference on Image Processing (ICIP 2024)