|
An environment acoustic model represents how sound is transformed by the
physical characteristics of an indoor environment, for any given source/receiver
location. Whereas traditional methods for constructing such models assume dense
geometry and/or sound measurements throughout the environment, we explore
how to infer room impulse responses (RIRs) based on a sparse set of images and
echoes observed in the space, as well as how to choose where to collect these audio-
visual observations. Towards that goal, we first introduce a transformer-based
method that uses self-attention to build a rich acoustic context, then infers the
RIRs of arbitrary query source-receiver locations through cross-attention. Then,
motivated by real-world physical constraints in collecting these observations, we
further introduce active acoustic sampling, a new task in which a mobile agent
jointly constructs the environment acoustic model and spatial occupancy map on-
the-fly from sparse audio-visual observations. We train a reinforcement learning
(RL) policy that guides agent navigation toward optimal acoustic data sampling
positions, rewarding information gain for the full environment model. Evaluating
on diverse unseen 3D indoor environments, our method outperforms the state-
of-the-art and---in a major departure from traditional methods---generalizes to
novel environments in a few-shot manner. Furthermore, when augmented with our
active sampling policy, it successfully guides an embodied agent to acoustically
informative positions given real-world exploration constraints, outperforming
both traditional navigation agents and prior acoustic rendering methods.
|