The VBS is an international video content search competition that evaluates the state-of-the-art of interactive video retrieval systems. It is performed annually as a special event at the International Conference on MultiMedia Modeling (MMM) since 2012. It aims at pushing research on large-scale video retrieval systems that are effective, fast, and easy to use for content search scenarios that are truly relevant in practice (e.g., known-item search in an ever-increasing video archive, as nowadays ubiquitous in many domains of our digital world).
The participants try to solve different types of content search queries that are issued in an ad-hoc manner. Although the dataset itself is available to the researchers several months before the actual competition, the queries are unknown in advance and issued on-site. There are different types of queries:
- Known-Item Search (KIS): a single video clip (a few seconds long) is randomly selected from the dataset and visually presented – this is known as visual KIS (KIS-V). The participants need to find exactly the single instance presented. A variation of this task is textual KIS (KIS-T), where instead of a visual presentation, the searched segment is described only by text given by the moderator (and presented as text via the projector). Another variant of this is KIS-C, where the target scene is also textually described, but only with minimal details in the beginning. Further details are revealed after 60 seconds based on questions/chats from participants.
- Ad-hoc Video Search (AVS): here, a rather general description of many shots is presented by the moderator (e.g., „Find all shots showing cars in front of trees“) and the participants need to find as many correct examples (instances) according to the description.
- Visual Question Answering (VQA): this task type asks specific questions about a particular video or video collection, which are intended to be submitted as a manually entered text (by a human). For example, it could show a video clip and ask „How many nights do we see passing in the video until this segment?„. It requires manual inspection and interactive exploration.
Each query has a time limit (e.g., 5 minutes) and is rewarded on success with a score that depends on several factors: the required search time, the number of false submissions (which are strongly penalized), and the number of different instances found for AVS tasks. For the latter case it is also considered, how many different ‚ranges‚ were submitted for an AVS tasks. For example, many different but temporally close shots in the same video count much less than several different shots from different videos.