infolab

infolab

WEB AND DISTRIBUTED INFORMATION MANAGEMENT

We have released the following datasets for non-commercial research purposes, under a Creative Commons License (refer below). The datasets were collected through crawlers that were based on the respective APIs, or via HTML parsers. You’re welcome to use our datasets to conduct your research. Please cite the corresponding paper when you use a specific dataset from the paper. Let us know if you have any questions.

Geo-Tag Hashtag Dataset

Download, readme, bibtex, contact

We provide geo-tagged hashtags collected on Twitter. The dataset was used for experiments in the paper “Spatio-Temporal Dynamics of Online Memes: A Study of Geo-Tagged Tweets” in WWW 2013. The dataset contains 99,015 hashtags, and 20,949,293 occurrences of those hashtags with locations and timestamps of where and when the hashtags occurred.

CITATION: K. Kamath, J. Caverlee, K. Lee, and Z. Cheng. Spatio-Temporal Dynamics of Online Memes: A Study of Geo-Tagged Tweets. In Proceeding of the 22nd Annual ACM World Wide Web (WWW) Conference 2013, Rio de Janeiro, May 2013.

Social Honeypot Dataset

Download, readme, bibtex, contact

We provide social honeypot dataset collected from December 30, 2009 to August 2, 2010 on Twitter. The dataset was used for experiments in the paper “Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter” in ICWSM 2011. The dataset contains 22,223 content polluters, their number of followings over time, 2,353,473 tweets, and 19,276 legitimate users, their number of followings over time and 3,259,693 tweets.

CITATION: K. Lee, B. Eoff, and J. Caverlee. Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter. In Proceeding of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM), Barcelona, July 2011.

Location Sharing Services Dataset

Download, readme, bibtex, contact

We provide both check-ins and users data (collected from September 2010 to January 2011) in the paper “Exploring Millions of Checkins in Location Sharing Services” in ICWSM 2011.

CITATION: Z. Cheng, J. Caverlee, K. Lee, and D. Z. Sui. Exploring Millions of Footprints in Location Sharing Services. In Proceeding of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM), Barcelona, July 2011.

Microblog Location Dataset

Download, readme, bibtex, contact

We provide both the training and test sets used in our paper. . The training set contains 130,689 Twitter users and 4,124,960 updates from the users. All the locations of the users are self-labeled in United States in city-level granularity. The test set contains 5,136 Twitter users and 5,156,047 tweets from the users. All the locations of users are uploaded from their smart phones with the form of “UT: latitude,longitude”.

CITATION: Z. Cheng, J. Caverlee, and K. Lee. You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users. In Proceeding of the 19th ACM Conference on Information and Knowledge Management (CIKM), Tonronto, Oct 2010.

Social Datasets by Infolab is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License