infolab

infolab

WEB AND DISTRIBUTED INFORMATION MANAGEMENT

The resources here are either provided by the infolab or are highly recommended. We hope you find them useful.


Twitter small corpus

download 1000 randomly selected twitter users and their tweets: This is an sample.tgz. Each user’s tweets reside within a file with the following attributes:

For example, a tweet sent by Joe would be in joe.tweets and be formatted as follows:

2008-11-14T16:01:38+00:00 i had catfood for dinner

NOTE: The contents of the tweets have not been censored in any way. All content is publically available, so no anonymization has been performed.

Software Toolkits and APIs

This is a collection of software that the infolab members have found particularly useful at one time or another. We are incredibly greatful to the authors of these assets.

Twitter NLP Tools

Part-of-speech

  1. Tool: https://nlp.stanford.edu/software/tagger.shtml
  2. Tool: http://www.cs.cmu.edu/~ark/TweetNLP
  3. Paper: http://www.cs.cmu.edu/~ark/TweetNLP/gimpel+etal.acl11.pdf

Named entity taggers

  1. Tool: https://github.com/aritter/twitter_nlp
  2. Paper: https://homes.cs.washington.edu/~mausam/papers/emnlp11.pdf
  3. Paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.481.6809&rep=rep1&type=pdf

Syntactic parsers

  1. Tool: http://www.cs.cmu.edu/~ark/TweetNLP
  2. Paper: https://www.cs.cmu.edu/~nschneid/twparser.pdf

Negation detection

  1. Tool:http://blulab.chpc.utah.edu/content/contextnegex
  2. Tool: https://sourceforge.net/projects/scopefinder

Sentiment Polarity Lexicons

  1. LIWC 2300 words: https://liwc.wpengine.com
  2. General Inquirer 4206 words: http://www.wjh.harvard.edu/~inquirer
  3. NRC-Canada: http://www.saifmohammad.com/WebPages/Abstracts/NRC-SentimentAnalysis.htm
  4. Bing Liu’s lexicon 6786 words: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
  5. MPQA 8000 words: http://mpqa.cs.pitt.edu
  6. Bootstrapping sentiment polarity lexicons starting from a small set of seeds: positive and negative terms (words and phrases). The dominant approach is that of Turney’s Paper: http://www.aclweb.org/anthology/P02-1053.pdf

Opinion finding

  1. Tool: http://mpqa.cs.pitt.edu/opinionfinder

Features for supervised learning

  1. NRC-Canada: http://www.saifmohammad.com/WebPages/Abstracts/NRC-SentimentAnalysis.htm

Get Twitter Data by

  1. NLTK http://www.nltk.org/howto/twitter.html
  2. Twitter Scraper https://github.com/taspinar/twitterscraper

Benchmarks or other more resources

  1. Semeval: http://alt.qcri.org/semeval2018