Reddit, in case you’re one of the few people who haven’t been there, is a link sharing site that has a very successful voting system, both on the links and comments. Reddit, according to Alexa is the 50th most popular website world wide, and is in the top 20 in the USA.
To abuse an Obi-Wan Kenobi quote: “You will never find a more wretched hive of scum and villainy.” So obviously Reddit comments are very interesting, to say the least.
I’ve been interested in using NLP to get a high-level view of what’s going on in Reddit comments. This post will introduce you to the Reddit API, PRAW (Python Reddit API Wrapper), and show how fast you can analyze Reddit comments just using python, Semantria, and the Excel Plugin.
The Reddit API allows for programmatic access to all the content, comments, administrative tasks, and voting results. They do obfuscate the actual voting numbers so as to minimize the effect of voting spam, but we’re not going to mess around with voting in this post anyways, so, I mention this only to confuse you.
Since I like to work in Python, I’m using the Python Reddit API Wrapper. The Reddit API terms of service require certain limits on how often you call the service and such, and PRAW simply deals with all of that for me, letting me just focus on the content.
I want to do my analysis inside of Excel, so, I just put together a quick script to pull the top-level comments for a given submission.
Here’s the Pastebin for the script. You’ll need to install praw, and replace a few things inside of the script, but once you do that, it’s super simple to snag and analyze Reddit comments.
The script dumps comments into a file called “comments.csv” The Reddit comment thread I was most interested in grokking was this over on AskReddit: “What do you hope happens in the next 5 years?”
So, in order to get the quickest view of “what’s going on “ in the comments, I took the then 3k top level comment replies and ran them through the Excel plugin in “Discovery Mode” – and here’s what I found out…
Yeah, Comcast is pretty unpopular right now. A lot of negativity towards the USA. Canada (unsurprisingly) is the only country with just positive and neutral content.
It’s interesting how many of these themes are directly related to environmental issues. I also found the theme “sex marriage” to be funny (obviously coming from same-sex marriage)
Facets and attributes gives a different view from themes to show what people are saying about common subjects. So, from the chart above we can see that people want to get a good job, switch to renewable energy, and have a driverless car toting them around to meet with alien life.
I can certainly extract a lot more detail if I care to, by using other tools inside of Excel or pulling the content into a BI tool like Tableau and doing more analysis there. But, the entire point of this post is to show how easy it is to analyze Reddit comments. In just a step or two you can go from a ream of comments on a board to a really useful summary of what’s happening.
Now if you’ll excuse me, I have to go re-install SETI@Home. Gotta find those aliens.