MCG-LSDVCD: A Large Scale Dataset for Visual Concept Detection

1. Introduction of visual concept detection

The goal of visual concept detection is to predict the presence of certain concepts, such as animal, buildings, mountain, etc. It is essentially a classification task, in which a classifier is usually learned based on a set of training images or video keyframes [1]. Visual concept detection on real life data, like Internet images and web videos, is computationally expensive. Learning classifier models usually involves huge volume of training data. Realtime feature extraction and classification are challenging too. The efficiency challenges become especially prominent when building a realtime concept detector on millions of samples [2].

Here, we present an extremely efficient concept detection system based on a novel bag of words extraction method and sparse ensemble learning. We will show that the presented system can efficiently build the concept detectors upon millions of images, and achieve realtime concept detection on unseen images with the state-of-the-arts accuracy. To do so, we first develop an efficient bag of visual words (BoW) construction method based on sparse non-negative matrix factorization (NMF) [1,2] and GPU enabled SIFT feature extraction [3]. We then develop a sparse ensemble learning method to build the detection model, which drastically reduces learning time in order of magnitude over traditional methods like Support Vector Machine. We also construct a large training set with both pseudo relevance feedback of negative samples and interactive feedback of positive samples. The demo video of our concept detector system is available at YouTube.

2. Collection and annotation of training set

Although the explosion of Internet images makes it much easier to collect the training data, how to annotate such large scale data becomes a challenge. Here, we adopt a feedback annotation mechansim to alleviate the human labeling efforts. First, we select and annotate 110K (the first 1-11th subdirectories) images from MIRFLICKR-1M dataset [4,5], 17K newly downloaded images from Flickr [6] and 113K images from ImageNet [7] on 8 concepts from TRECVID: cityscape, building, plant, waterscape, landscape, car, animal and flower. We then take these images as seed to learn concept detectors, and apply them on the other 300K images (the 21-40 and 91-100 subdirectories) in MIRFLICKR-1M dataset. Through the interactive interface in our demo, 8 annotators are asked to check the samples with high scores for each concept. After manual checking, we add both false positive and true positive samples into our final training set. For images with very low score, we treat them as negative training data. The rational is that samples with very low scores are often true negative samples with very high probability (above 99% in our experiments due to the high recall rate of our system). In total, we collect over 540K training images with very minimal human labeling since the proportion of high score samples to be checked is very small.

The positive, negative and non-labeled numbers of each concept in the training set are listed in Table 1.

Table 1. The positive, negative and non-labeled numbers of each concept in the training set

Concept Postive Negative Non-labeled Total
animal 30,945 255,485 254,490 540,920
building 31,268 283,482 226,170
car 19,000 366,765 155,155
cityscape 9,489 363,274 168,157
flower 17,921 331,074 191,925
plant 32,509 207,753 300,658
landscape 20,563 297,180 223,177
waterscape 14,810 368,311 157,799

3. Test set and groundtruth

We select and annotate another 20K images from MIRFLICKR-1M dataset (the 81-82 subdirectories) as our test set. The positive and negative numbers with regard to each concept are listed in Table 2.

Table 2. The positive and negative numbers of each concept in the test set

Concept Postive Negative Total
animal 1,616 18,384 20,000
building 1,422 18,578
car 330 19,670
cityscape 1,749 18,251
flower 1,209 18,791
plant 3,242 16,758
landscape 1,870 18,130
waterscape 1,104 18,896

4. Feature extraction

We utilize the bag-of-visual words (BoW) as image feature. To achieve high efficiency for realtime application, we use the GPU implementation of SIFT named SiftGPU [3] for fast extraction of SIFT features. Rather than traditional vector quantization (VQ), we employ the sparse coding to build BoW image representations. Sparse coding can achieve lower reconstruction error, thus lower information loss in quantization, resulting from less restrictive constraint [8]. Specifically, we use sparse NMF due to its use of non-negativity constraints that allow only additive instead of subtractive combinations which exhibits its advantages in learning parts-based representations and semantic features of text[9]. Instead of using the max pooling function on the sparse codes, we propose to compute the BoW histogram through a soft-weighting scheme. Supposex_{i} \in R^{l} is the i-th local descriptor from the observed feature matrixX=[x_{1},x_{2},...,x_{N}] \in R^{l\times N} , and d_{j}is the atom from a learned dictionary D=[d_{1},d_{2},..,d_{k}]\in R^{l \times k}, and the dimensionality , l = 128 for SIFT, we compute the BoW histogram through weighted sum of similarities between descriptors and dictionary atoms with l_{1}-normalized sparse code as:

z=\sum_{i=1}^{N}\left( \frac{1}{\sum_{j=1}^{M}\alpha_{ij}}\sum_{j=1}^{M}\alpha_{ij}sim(x_{i},d_{j}) \right)

where the sparse code\alpha_{ij}denotes the non-zero responses of local descriptor x_{i}to the corresponding atom d_{j}. The measure sim(x_{i},d_{j})represents the similarity between x_{i}and d_{j}. In this demonstration, we use the cosine similarity and restrict the maximum number of non-zero responses (sparsity) M = 4 during sparse coding.

In our implementation, we extract SIFT on the GPU of NVIDIA GeForce GTX 580. After randomly selecting 10M SIFT descriptors, we utilize the online learning methods proposed in [10] to train the dictionary. It costs only 1,714 seconds (less than half an hour) for learning the dictionary from the 10M SIFT, very efficient compared with fast k-means (7.6 hours with Cluto [11]). For sparse decomposition, we use the efficient and robust LARS-Lasso algorithm proposed in [12]. Since M is far less than the vocabulary size k(=512 in our system), extraction of BoW takes only 142 ms, about 12 times faster than traditional BoW extraction through k-means VQ of standard SIFT(including SIFT extraction 912 ms and BoW representation 828 ms).

Besides GPU-enabled BoW (512-D) via sparse NMF, we also extract color moments (225-D) and edge histogram (80-D) for concept detection.

5. Our baseline and demonstration

Figure 1: Framework of Sparse Ensemble Learning

Large scale training dataset imposes a great challenge for efficiency of both training and testing. To overcome the low efficiency and unscalability of the classical methods based on global classification, we proposed sparse ensemble learning (SEL) for visual concept detection in [1]. As shown in Figure 1, it leverages sparse code of image features for both partition of large scale training set into small localities for efficient training and coordination of the individual classifiers in each locality for final classification. The sparsity ensures that only a small number of classifiers in the ensemble will fire on a testing sample, which not only gives better average precision (AP) than other fusion methods like average fusion and adaboost, but also allows the high efficiency of online detection.

Figure 2: System Interface

In this demonstration, we will show:
(1) The system can build detection models on millions of training images very efficiently.
(2) The system can perform realtime concept detection on unseen images with the state-of-the-arts accuracy.

Figure 2 shows the interface of the concept detection system. The system is an integration of our efficient GPU-enabled BoW feature extraction (prior fusion with edge histogram and color moments) and the SEL framework (the number of localities is 1024, and the replication parameter L = 3). The training of a concept detector (not including feature extraction) takes around 2 hours on 0.54 million of images (actually 1.6M samples after assigning a sample to multiple L localities in the SEL framework). To the best of our knowledge, this efficiency outperforms all existing systems. The demonstration will also show that concept detection on unseen images achieves realtime efficiency. It takes about 320 ms for all of 8 concepts detectors to examine one image. Specifically, the feature extraction for BoW takes 142 ms, edge histogram 20 ms and color moments 70 ms. The classification process only costs about 11 ms per concept despite of large scale training set.

The preliminary results of our demonstration system are shown in Table 3. Our MAP (Mean Average Precision) on the 8 concepts is 0.6030. From the table, we can see that over 90% of the top 100 among the result lists are correct for the 6 concepts: building, cityscape, flower, plant, landscape, waterscape, which shows the effectiveness of our SEL method.

Table 3. Average Precision and Number of Hits at different depths in our result set

Concept Depth 100 Depth 200 Depth 500 Depth 1000 Depth 2000 Total Positive Average Precision
animal 48 134 340 577 875 1,616 0.4394
building 91 174 380 554 839 1,422 0.5254
car 73 114 191 225 277 330 0.4502
cityscape 96 191 458 853 1,498 1,749 0.8327
flower 92 172 348 535 770 1,209 0.5267
plant 99 196 467 886 1,579 3,242 0.7446
landscape 97 187 446 759 1,236 1,870 0.6906
waterscape 97 180 401 605 805 1,104 0.6145
MAP 0.6030

6. Download

To encourage research work on visual concept detection or semantic indexing, automatic annotation, and image classification, transfer learning (such as between our dataset and TRECVID dataset) etc., we will provide the following information about the dataset according to your request:

(1) The image name lists for both training and test dataset.
(2) The images of training set downloaded from Flickr [6]. According to the image name lists, all other images can be downloaded from ImageNet [7], and the MIRFLICKR-1M Dataset [4].
(3) The features (BoW, Color moments, Edge histogram) of training set and test set.
(4) The annotation of the training set and ground-truth of the test set.
(5) The demo video and paper submitted to ACM Multimedia 2013 conference can be downloaded here: [Demo video] [Demo paper].

For the all these information, please email your full name and affiliation to the contact person ( Once your information has been verified, we will send you links to download the above zipped files. We ask for your information only to make sure the dataset is used for non-commerical research purposes, we will not give it to any third parties or publish it publicly anywhere.

7. Contact Person

If you have any questions about the MCG-LSDVCD dataset, feel free to contract:

Name: Tang Sheng
Detailed contact information can be found in his homepage:

We are continuously striving to improve the dataset, and will greatly appreciate any comments and suggestions from you.

8. Acknowledgments

We would like to express our sincere thanks to our team members: Yang Cao, Ya-Lin Zhang, You-Cheng Jiang, Wu Liu, Ji Wan, Yu Wang, Fei-Die Liang, Lei Huang, Xue-Ming Wang, Qi Han, Zuo-Xin Xu, Hui Chen, Ming Lin, Bo Ning and Cai-Hong Sun, for their laborious manual annotation of the training set and the groundtruth of the test set.

9. References

  1. Sheng Tang, Yan-Tao Zheng, Yu Wang, Tat-Seng Chua, ¡°Sparse Ensemble Learning for Concept Detection¡±, IEEE Transactions on Multimedia, Volume: 14 Issue: 1, pages: 43-54, February 2012. [Download]
  2. Sheng Tang, Yong-Dong Zhang, Zuo-Xin Xu, et al; ¡°An Efficient Concept Detection System Via Sparse Ensemble Learning¡±, Neurocomputing, Volume 169, Pages 124-133, December 2015.
  3. Sheng Tang, Hui Chen, Yu Li, Jun-Bin Xiao and Jin-Tao Li, A Sparse Ensemble Learning System For Efficient Semantic Indexing, ACM International Conference on Multimedia Retrieval (ICMR), Shanghai, China, June 23-26, 2015.
  4. C. Wu. SiftGPU: ¡°A GPU implementation of scale invariant feature transform (SIFT)¡±., 2007.
  5. MIRFLICKR-1M Dataset:
  6. M. J. Huiskes, B. Thomee, and M. S. Lew. ¡°New trends and ideas in visual concept detection: the MIR Flickr retrieval evaluation initiative¡±. In MIR '10, 2010.
  7. Flickr:
  8. ImageNet:
  9. J. Yang, K. Yu, Y. Gong, and T. Huang. ¡°Linear spatial pyramid matching using sparse coding for image classification¡±. In CVPR '09, 2009.
  10. D. D. Lee and H. S. Seung. ¡°Learning the parts of objects by non-negative matrix factorization¡±. Nature, 401(6755):788-791, Oct. 1999.
  11. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. ¡°Online learning for matrix factorization and sparse coding¡±. J. Mach. Learn. Res., 11:19-60, March 2010.
  12. G. Karypis, ¡°CLUTO-a clustering toolkit,¡± DTIC Document, 2002.
  13. T. J. I. Efron, Bradley; Hastie and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407-499, 2004.
  14. Our other related publications on Ensemble Learning for Visual Concept Detection:

  15. Yan Song, Yan-Tao Zheng, Sheng Tang, Xiangdong Zhou, Yongdong Zhang, Shouxun Lin, Tat-Seng Chua, ¡°Localized Multiple Kernel Learning for Realistic Human Action Recognition in Videos¡±, IEEE Transactions on Circuits and Systems for Video Technology, Volume: 21, Issue: 9, Pages:1193-1202, September, 2011.
  16. Sheng Tang, Yan-Tao Zheng, Gang Cao, Yong-Dong Zhang and Jin-Tao Li; "Ensemble Learning with LDA Topic Models for Visual Concept Detection", in book: Multimedia - A Multidisciplinary Approach to Complex Issues (ISBN: 978-953-51-0216-8), InTech , chapter 9, pages: 175-200, March, 2012.
  17. Sheng Tang, Jin-Tao Li, Yong-Dong Zhang,etal; ¡°PornProbe: an LDA-SVM based Pornography Detection System¡±; ACM Multimedia 2009, Beijing, China, Oct.19-24, 2009.
  18. Sheng Tang, Jin-Tao Li, Ming Li, Cheng Xie, Yi-Zhi Liu, Kun Tao, Shao-Xi Xu; ¡°TRECVID 2008 High-Level Feature Extraction By MCG-ICT-CAS¡±; Proc. TRECVID 2008 Workshop, Gaithesburg, USA , Nov 2008.