Twitter Mining

Why Twitter?

  1. Twitter’s openness for public consumption, its broad popularity and well documented API makes it an ideal starting point for social web mining.
  2. Tweets are easy to make and it is not surprising that many tweets reflect events happening in near real time allowing us to quickly glean useful information from important events that are happening.
  3. Whenever we collect data for analysis, it is always important to avoid collecting a bias sample and in this aspect twitter is perfect.  Twitter’s following mechanism links people in a variety of ways, ranging from short conversational dialogues to interest graphs that connect people and topics they care about and ultimately represent a broad cross-section of society at an international level.
  4. Sites like Facebook and LinkedIn require the mutual acceptance of a connection between users while Twitter’s following mechanism allows you to keep up with the happenings of any users/topics! (undirected social graph vs directed social graphs)
  5. The 140 chars limitation per tweet means that each tweet carries concise information which makes it easy to infer the context of the words when doing sentiment analysis etc compared to other more verbose forms of social media like blogs.

Interest Graph

The Interest graph can be described as a way of modelling connections between people and their interests. It allows the measuring of correlations between things for the objective of making intelligent recommendations like who to follow, what to purchase etc

What exactly is Twitter and what is a tweet?

Twitter is essentially a realtime highly social microblogging service that allows users to post short status updates, called tweets, that appear on timelines.

Tweets are limited to 140 characters of content that contain 1 or more entities and reference one or more places that map to locations in the real world.
To be more specific, tweets come bundled with 2 notable pieces of metadata: entities and places.

Entities: hashtags, URLs and various media

Places:Locations in the real world that may be attached to a tweet. It can be the actual location in which the tweet was made or a reference to a place described in a tweet

Timelines are chronologically sorted collections of tweets and they only display tweets that are deemed to be noteworthy according to certain metrics(number of retweets etc) .

When you log into twitter, the home timeline is what you will see and it displays tweets from users you are following.

A user timeline however only displays collection of tweets from that particualr user.

To find out what a particular user’s home timeline looks like, we can simply add a /following suffix to the url!

Timelines are essentially collections of tweets with relatively low velocity while streams are samples of public tweets flowing through twitter in realtime. The public firehose of all tweets has been known to peak at hundreds of thousands of tweets per minute during events with particularly wide interest

Twitter provides protocols to obtain small random sample of the public timeline that provides filterable access to enough public data for API developers to develop useful and powerful applications.


Social Network Analysis

I have always been interested in SNA projects like twitter sentiment analysis for customer profiling or even mapping of terrorist organizations for national security!

I am a big fan of O’ Reilly books and recently I came across Social Network Analysis for Startups by Maksim Tsvetovat & Alexander Kouznetsov. It is a rather information dense but short book on SNA and uses Python and its various libraries  to demonstrate how to gather, visualize and analyze social data.

The content is written in a rather accessible way and if you have some familiarity with python you will find it easy to read.


Getting Started

I am an engineering graduate who will be commencing his Master of Computing degree at NUS in Aug 2014.

I am a data science enthusiast and I will be chronicling my journey into data science!

Exposure to topics like neural networks, expert systems and artificial intelligence from my Chemical Engineering degree got me deeply interested in the field of data science and machine learning.

I have taken it upon myself to study up on these topics and to date I have covered the following courses:

  • Data Structures and Algorithms – UC Berkeley’s CS61B
  • Databases – Stanford Online
  • Machine Learning by Andrew Ng on Coursera
  • Several courses from the John Hopkin’s Data Science Specialization on Coursera
  • Introduction to Data Science by the University of Washington

and textbooks:

  • Introduction to statistical Learning
  • Collective Intelligence
  • Machine Learning for Hackers
  • Data Mining with R

Things on my to-read list are:

  • Bayesian Reasoning and Machine Learning
  • The Elements of Statistical Learning
  • Natural Language Processing with Python
  • Pattern Recognition and Machine Learning
  • Convex Optimization

I have recently started working on Kaggle datasets too so I will also be describing my approaches to these problems.