Tweeter Big Data Mining

rbdlab haifa
Nov 25, 2017
2 min read

About:

In this project we mined tweeter data in order to compare topic popularities comparing to google trends in corresponding timelines. The assumption is that people tweet as they search, and that mining tweets in a given period of time will give similar behavior when it comes to the amount of tweets / search in tweeter / google search by web users. In this project we learned how to work with tweeter large database (~1000 GIGA Bytes of tweets) and focused on the first year of 2015.

Code and Dataset:

Project code can be found Here. Datasets can be found Here.

Development:

We built several python modules:

We built a tweeter parser, that extracts from a tweet the body itself while removing hashtags and links. We cleaned the body from English stop words and punctuations using the well known text mining library NLTK. We built a module that creates text files of words separated by space delimiter from all the body in the tweets We wrote a module that builds a CSV database from these files for each half month that serves a wordcount map for each word in the file parsed.

At first, we wrote a javasript nodejs server that loaded the CSV files into its memory and served HTML web clients using REST API the ability to query the CSV files for wordcounts of a given word interactively. The server suffered from a memory issues and could not be ran on a normal PC. So we turned to MongoDB.

We built a database out of the mapped CSV files using mongoDB database, and wrote a nodeJS server that communicates with the mongoDB.

The Research:

We researched for top trending topics from different fields, politics, merchandise, sport and more and searched for topics which had data ‘spikes’ in the first half of 2015. We extracted from google trends the relevant data (exported to CSV) and queried the mongoDB for the data at the same period for that topic.

Research results can be found in the following document.