Twitter Analytics 1 – Creating a framework to fetch twitter data

With this article, I am starting a small blog series called Twitter Analytics. My goal is to see whether it is possible to discover meaningful social media insights by applying data analytics on the social network Twitter.

This is the first part of the series where I talk about how fetching and storing data from Twitter works and why it is way harder than I thought.

Articles: Part 1 | Part 2 | Part 3

Twitter as a data source

How do you get a good data set of tweets? This was the first question that I had to ask myself at the beginning of this project. The solution seemed quite simple: use an already existing data set to dodge the hassle of fetching and cleaning up the data myself.

That was easier said than done. I spend a few days looking through different data sets and sources and not a single one fitted my needs. They were either outdated, only usable for a special kind of analysis, or they did not have the features that were needed. So I decided to get my data right from the source by asking Twitter itself.

Creating a tweet downloader

Getting data from the Twitter API is not very straightforward. First of all, there are different rate limits, which means that data can only be fetched in chunks. Furthermore, there is a distinction between retrieving current tweets (from a week ago until today) and retrieving archived tweets (reaching back all the way to 2006). It was pretty clear to me that I did not want to manually stitch the different requests together, which is why I spent a week building a framework for downloading Twitter data.

The downloader automatically splits my queries into a set of requests and then fetches the data from Twitter step by step. I made sure to append the results of every single request immediately to a CSV file. By doing that, I was able to work with the already-fetched tweets, even if the download did run into an error halfway through. This also allowed me to run my analytics code on the data set while it was still downloading. I often would check on a download at 5000 tweets to see whether the data set fits my needs and then decided to continue or stop the download. This is the frontend that I created for the downloader:

The big table at the bottom gives an overview of the data sets that I previously downloaded. This is quite handy since I can easily see the metadata of every data set and compare it. The keyword, the number of tweets in the data set, the timeframe and the minimum engagement rate are the key metrics that determine the content and quality of my data set. I will talk more about them in the next part.

On the top left, you can see the UI for creating a new data set. It allows me to set the parameters “keyword”, “minimum engagement” and the “amount of tweets”.

  • The “keyword” parameter is pretty simple because it just defines that only tweets, which contain the keyword should be fetched. I use this parameter to set the topic that the tweets should be about.
  • By using the “minimum engagement” parameter, I can set a minimum amount of likes, retweets and replies a tweet must have to be stored in the data set. This is a key metric because it allowes me to cut out the huge pool of tweets that nearly no one interacted with. The tweets that got none or only very few interactions are not very interesting for my analysis, therefore they are just unnecessary API and memory usage. Furthermore, I can capture longer timeframes with less API calls by setting a high minimum engagement number.
  • At last, I used the “amount of tweets” parameter to tell my downloader when the data set is completed and it can stop issuing new requests.

Furthermore, I also implemented an analytics controller so that I can choose to run different analysis once the data is downloaded to get further insights into the quality of the data that I fetched.

Technical overview of my framework

Last but not least, here is a full overview of the framework that I have created. I am very much looking forward to using this framework in the future for some interesting Twitter analytics – so stay tuned for part 2!