Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

arXiv, 2026

Ami Baid, Zihui Xue, Kristen Grauman

UT Austin

We propose Audio-Contrastive Preference Optimization (ACPO), a dual-axis preference learning framework that forces audio-visual language models to faithfully ground their responses in the actual audio signal rather than exploiting visual shortcuts.

Abstract

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

Video

5-minute supplementary video with an overview of ACPO and qualitative examples

Motivation

Cross-modal hallucination in AVLMs is an asymmetric phenomenon driven by visual dominance. AVLMs systematically default to visual priors: audio tokens receive disproportionately low attention weights during decoding, and models are notably more prone to hallucinating on audio-focused tasks. Feeding more video frames into AVLMs actively triggers more audio hallucinations, while adding audio causes no such degradation to visual QA tasks.

ACPO Framework

(a) Multimodal Data Curation: joint audio-visual captions are decomposed into modality-specific visual and audio targets, and audio-swapped inputs are constructed by replacing the original audio track with a mismatched one. (b) Preference Pair Construction: Audio-attribution pairs (left) use the swapped input to penalize visually-driven responses to audio-focused prompts, preferring the true audio description over the visual one. Audio-sensitivity pairs (right) penalize audio-invariant predictions by preferring the original audio-visual caption under aligned audio over the same caption under mismatched audio.

Qualitative Results

See the supplementary video above to watch/listen to these examples.

Video-driven audio hallucination. Each row shows a video clip, its corresponding audio waveform with labeled sound events, and model responses to an audio-focused yes/no question. In all three cases, the audio contains no evidence of the queried sound, yet all baselines hallucinate affirmative responses. ACPO (Ours) correctly grounds its response in the audio signal.

Audio-focused captioning. Each row shows a video clip with its labeled audio waveform, a reference audio caption, and model-generated audio captions. (a) The video depicts cows on a hillside, and the audio contains a man speaking and birds chirping. All baselines produce captions grounded in the visual scene, failing to describe the actual audio content. ACPO correctly identifies both auditory events. (b) The video shows frogs, but the audio contains a woman speaking and a cat meowing. All baselines hallucinate a frog croaking. ACPO alone correctly describes what is heard.