Clues from the Beaten Path:

Location Estimation with Bursty Sequences of Tourist Photos

Chao-Yeh Chen and Kristen Grauman
The University of Texas at Austin

cartoon

We propose to exploit the travel patterns among tourists within a city to improve location recognition for new sequences of photos. Our HMM-based model treats each temporal cluster (“burst”) of photos from the same camera as a single observation, and computes a set-to-set matching likelihood function to determine visual agreement with each geospatial location. Both the learned transition probabilities between locations and this grouping into bursts yield more accurate location estimates, even when faced with non-distinct snapshots. For example, the model benefits from knowing that people travel from L1 to L2 more often than L3 or L4, and can accurately label all the photos within Burst 2 even though only one (the Statue of Liberty) may match well with some labeled instance.

Motivation - where did I take these pictures?

(1) For image with distinct features, we could recognize its location by the nearest neighbor method.

ex2

(2) If we know three pictures are taken within a short time, we assume they are coming from one place. Thus we can estimate the location of the “burst” rather than the location of each image.

ex3

(3) For successive images, if we know the location of previous one, we exploit the learned tourists' travel patterns in order to better infer the labels for the entire sequence of test photos.

Abstract

Existing methods for image-based location estimation generally attempt to recognize every photo independently, and their resulting reliance on strong visual feature matches makes them most suited for distinctive landmark scenes. We observe that when touring a city, people tend to follow common travel patterns—for example, a stroll down Wall Street might be followed by a ferry ride, then a visit to the Statue of Liberty or Ellis Island museum. We propose an approach that learns these trends directly from online image data, and then leverages them within a Hidden Markov Model to robustly estimate locations for novel sequences of tourist photos. We further devise a set-to-set matching-based likelihood that treats each “burst” of photos from the same camera as a single observation, thereby better accommodating images that may not contain particularly distinctive scenes. Our experiments with two large datasets of major tourist cities clearly demonstrate the approach’s advantages over traditional methods that recognize each photo individually, as well as a naive HMM baseline that lacks the proposed burst-based observation model.

Approach

Training stage

(1) Discovering a city’s locations

We define the location by applying mean-shift clustering to the GPS labels of the training images. These two figures depict the defined location for our two datasets

(2) Feature extraction

For every image, extracts three visual features: Gist, a color histogram, and a bag of visual words. Gist captures the global scene layout and texture. Color histogram characterizes certain scene regions such as green plants in a park. The bag-of-words descriptor captures the appearance of component objects.

(3) Location summarization

As for certain popular locations contain many more images than others, the non-distinct images can bias the observation likelihood. For example if 5% of the training images contain a car, then the most popular locations will likely contain quite a few images of cars. At test time, any image containing a car could have a strong match to them, even though it may not truly be a characteristic of the location. In order to automatically select the most important aspects of the location with minimal redundancy, we apply the efficient spherical k-centroids algorithm to each location’s training images when the distribution of images among locations is highly unbalanced.

(4) Learning the Hidden Markov model

To train the HMM, we learn the initial state priors and state transition probabilities from the traveling image sequences from training data. These two figures show the relation between HMM and our location estimation problem.

Testing stage

(1) Grouping photos into bursts

We apply Mean shift on the timestamps to compute the bursts. A burst is meant to capture a small event during traveling. When inferring the labels for a novel sequence, we assume that all photos in a single burst have the same location label.

(2) Location Estimation via HMM

As in the testing stage, our goal is to estimate the most likely series of locations. The initial state priors and state transition probabilities have been learned from training.

Here we define the observation likelihood distribution as:

where

then by Bayes Rule, we have

The distance D(It , Im) defines the visual feature similarity between two images. is the regularization constant and is the scaling parameter.

set2set burst_likelihood

Given a burst Bt that contains G images, say we retrieve K = 3 neighbors for each test image, giving 3×G retrieved training images. Among them, image 1, 3, 6, and M are from L1, which means that, M1 = {In1 , In3 , In6 , InM}. Thus, numerator of ω(Im) is affected by the four pairs in the grey circle from the left figure. For example, given the burst in the right figure, among all the retrieved training images, the four images inside red circle are coming from location M1. Then the similarity between these four images and their query image determines the observation likelihood of this burst to M1.

Comparison with other methods

The authors of [1] develop an HMM-model parameterized by time intervals to predict locations for photo sequences taken along transcontinental trips. Their work also exploits human travel patterns, but at a much coarser scale: the world is binned into 3,186 400 km2 bins, and transitions and test-time predictions are made among only these locations. A further distinction is our proposed burst based set-to-set observation likelihood. We show our burst based method (Burst-HMM) outperforms image based likelihood method (Int-HMM) in the result part.

The authors of [2] consider location recognition as a multi-class recognition task, and the five images before and after the test image serve as temporal context within a structured SVM model. This strategy is likely to have similar label smoothing effects as our method’s initial burst grouping stage, but does not leverage statistics of travel patterns.

[1] E. Kalogerakis, O. Vesselova, J. Hays, A. Efros, and A. Hertzmann. Image sequence geolocation with human travel priors. In ICCV, 2009.

[2] Y. Li, D. J. Crandall, and D. P. Huttenlocher. Landmark classification in large-scale image collections. In ICCV, 2009.

Results

We test our method on two datasets: Rome and New York. Our experiments demonstrate the approach with real user-supplied photos downloaded from the Web. We make direct comparisons with four key baselines, and analyze the impact of various components.

(1) Properties of two datasets:

Mean shift on the training data discovers 26 locations of interest for Rome, and 25 for New York. The average location size is 0.2 mi^2 in Rome, and 3 mi^2 in New York. The ground truth location for each test image is determined by the training image it is nearest to in geo-coordinates.

(2) Location estimation accuracy:

NN = nearest neighbor method, Img-HMM = image-to-image HMM, Int-HMM = image to image HMM incorporating length of interval between consecutive images (used in [1]), Burst Only = uses the same bursts as computed for our method, but lacks the travel transitions and priors, Burst-HMM = our method. Avg/seq is the average rate of correct predictions across the test sequences and Overall is the the correct rate across all test images.

(3) Qualitative results:

We show four example results comparing predictions by our Burst-HMM (“Ours”) and the Img-HMM baseline (“Base”). Images in the same cell are from the same burst. A check means correct prediction, an ‘x’ means incorrect.

Two traveling sequences from the New York dataset. In the first sequence, images with distinct features, such as Images 2-5, and 16-17, are predicted correctly by both methods. While the baseline fails for less distinctive scenes (e.g., Image 8-14), our method estimates them correctly, likely by exploiting both informative matches to another view within the burst (e.g., the landmark building in Image 8 or 13), as well as the transitions from burst to burst. Our method can also fail if a burst consists of only non-distinctive images (Image 1). In the second sequence, our method locate image 9-12 correctly by leveraging strong cues from image 13 and 14.

Two traveling sequences from the Rome dataset. In the first sequence, for the images within the red rectangular, our method correctly locates image 5 to image 14 due to the strong hints in image 3, 4, and 17, while the baseline fails due to the lack of temporal constraints and distinctive features. In the second sequence, our method predicts image 7 to image 10 correctly by the strong hints in image 5 and 6.

Dataset

Rome_NewYork_location_images

Poster

Download

Publication

Clues from the Beaten Path: Location Estimation with Bursty Sequences of Tourist Photos [ pdf ]

Chao-Yeh Chen and Kristen Grauman
To appear, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, June 2011.