Skip to main content

Twitter’s Language Mix : Assignment 1 (10 points)

(Due date TBA; 20% penalty per day for late submission.)

This assignment includes using Twitter’s streaming API, an off-the-shell language identification tool and data visualization. Some of the questions are open-ended, which means that there is no single best answer (do the best you can) and the grading will not be strict.

1. Get >=10k tweets from Twitter Streaming API following the instructions on Twitter API tutorial
2. and check:
  • are all tweets LangID tagged (what %) by Twitter?
  • how many different language tags provided by Twitter?
  • what % is each language?
3. then install/run langid.py and check:
  • how many different language tagged?
  • what % langid.py and Twitter’s API agree/disagree?
  • what kind of tweets/languages do they disagree?
4. what about tweets in US?
  • what % is each language?
  • what % of tweets are geotagged?
5. draw some fancy plots
  • For example, the language mix in Twitter like the Figure 5 and 7 in this paper
  • Matplotlib is a Python package for plotting, here is a simpe guide of Matplotlib.

Pack your data and code into a zip file named like hw1_yourdotid.zip. Submit your homework in OSU’s Carmen system.