Summary: Fusion vs. Two-Stage for Multimodal Retrieval
Avi Arampatzis, Konstantinos Zagoris, and Savvas A. Chatzichristofis
Department of Electrical and Computer Engineering,
Democritus University of Thrace, Xanthi 67100, Greece
Abstract. We compare two methods for retrieval from multimodal collections.
The first is a score-based fusion of results, retrieved visually and textually. The
second is a two-stage method that visually re-ranks the top-K results textually
retrieved. We discuss their underlying hypotheses and practical limitations, and
contact a comparative evaluation on a standardized snapshot of Wikipedia. Both
methods are found to be significantly more effective than single-modality base-
lines, with no clear winner but with different robustness features. Nevertheless,
two-stage retrieval provides efficiency benefits over fusion.
Nowadays, information collections are not only large, but they may also be multimodal.
Take as an example Wikipedia, where a single topic may be covered in several lan-
guages and include non-textual media such as image, sound, and video. Moreover, non-
textual media may in turn be annotated.
We focus on two modalities, text and image. On the one hand, textual descriptions
are key to retrieving relevant results for a topic, but at the same time provide little