Monday, November 14, 2011

Visualization of Social Network behind #OccupyWallStreet Twitter Hashtag

The following visualization was created using Microsoft NodeXL and the 'Group in a Box' method to show clusters within the OccupyWallStreet Hashtag social network.


Nodes are sized according to the their In-degree, or the number of times someone has mentioned that user in a tweet. The images for each node are the actual profile image for each user.

This image contrasts a similar image created by Marc Smith. I'm unsure of the difference in the image is due to the fact that his data is from 10/8/2011 or that I'm still a bit new to the new Group in a Box feature of NodeXL. I don't see many other ways to cluster the nodes so I'm inclined, at this moment, to say the difference in the network structure is due to a shift in the composition of the network due to the many events surrounding the 'Occupy' movement over the past month.

The data for this visualization was captured over a 22-hour period from the evening of 11/12/2011 - the afternoon of 11/13/2011. Previous blog posts show how to use Python and MongoDB to store and parse this data.

For comparison purposes, the below image is the same network without the Group in a Box clustering method applied:

What to do with my Twitter data once it is in my MongoDB?





My previous blog showed us how we can use Python, pycurl, pymongo, MongoDB and the Twitter Streaming API to import all tweets of a certain hashtag into our database. Once we have all of that data, how can we parse it so we can effectively use it? My last example collected the entire tweet.

Tweets, though limited to only 140 characters, are actually large when you observer the entire JSON object. (Recall the API returns the Tweet as a JSON object). An example tweet shows the large JSON structure. There is a lot of information in a Tweet so capturing the entire thing is worthwhile, especially since it is just a a few bytes of storage/tweet. However we'll need to parse each tweet to analyze the structure of our dataset. I won't get into the specifics of how to use JSON or the entire Twitter JSON object, but one will have to have have a general understand of how to use JSON to fully understand how we go about the example show below.

So, let's say we want to map the social network at play for a twitter database. We would want to extract the userID of the tweeter and whatever other users they tweet about. We can query the Database for certain fields of each tweet. We will want the entities.mentions.screen_names[array] and the user.screenname string. We'll loop through all of our tweets and print out a list of edges that would otherwise form a social graph. In this example if a user does not tweet about anyone, I still capture the tweet and show the link as a self-loop in order to capture the Out-Degree (for network analysis reasons) of each 'Tweeter.

So, the sample code would be:
import pymongo
import json

from  pymongo import Connection
connection = Connection()
db = connection.occupywallstreet
print db.posts.count()
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1, 'user.screen_name':1}).sort('user.screen_name', 1):
    if len(post['entities']['user_mentions']) == 0:
        print post['user']['screen_name'], post['user']['screen_name'] 
    else:
        for sname in post['entities']['user_mentions']:
            print post['user']['screen_name'], sname['screen_name']
buffer = ""
for post in db.posts.find({}, {'user.profile_image_url_https':1, 'user.screen_name':1}).sort('user.screen_name', 1):
    if buffer == post['user']['screen_name']:
        continue
    print post['user']['screen_name'], post['user']['profile_image_url_https']
    buffer = post['user']['screen_name']

It's pretty straight forward. We connect to the database, perform a query where we only return the screen_names of those a users mentions and the screen_name of the tweeter himself. This is accomplished with the following line:


for post in db.posts.find({}, {'entities.user_mentions.screen_name':1, 'user.screen_name':1}).sort('user.screen_name', 1):

The
.sort('user.screen_name', 1)
sorts the output so you have all of the activity per user in order.

The last loop gives me the image of the Twitter user. My end goal is to visualize this network in NodeXL and I will want to use the profile_image of the user as the shape of the node. Thus I iterate over all users and capture the profile_image_url_https value for each user with the following block of code:

for post in db.posts.find({}, {'user.profile_image_url_https':1, 'user.screen_name':1}).sort('user.screen_name', 1):
    if buffer == post['user']['screen_name']:
        continue
    print post['user']['screen_name'], post['user']['profile_image_url_https']
    buffer = post['user']['screen_name']

When it is all said and done I have all edges of my network along with URL's of the profile_image for each user in the database that Tweets.

Up next I'll share some visualizations I created with data I gathered using these methods.

Saturday, November 12, 2011

How to use Twitter's Filtered Streaming API, Python and MongoDB






When I started doing this I didn't see anywhere on the Internet that had the entire solution to the following problem:

"Track a hashtag following in Twitter and place into a MongoDB via Python".

For example, I needed to grab all Tweets that had the #occupywallstreet hashtag in them and place them in a Mongo Database using Python.

Why MongoDB?  It's easy, efficient and perfect for storing/performing queries on a large number of documents.  When the documents are Tweets encoded as JSON documents, it's even easier.

Why Python?  I had never used Python before but found nice and simple Twitter and MondoDB plugins to make this EASY.

So, to get to the meat of the problem, here is the code:

import pycurl, json
import pymongo

STREAM_URL = "https://stream.twitter.com/1/statuses/filter.json"
WORDS = "track=#occupywallstreet"
USER = "myuser"
PASS = "mypass"

def on_tweet(data):
    try:
        tweet = json.loads(data)
        db.posts.insert(tweet)
        print tweet
    except:
        return

from pymongo import Connection
connection = Connection()
db = connection.occupywallstreet
conn = pycurl.Curl()
conn.setopt(pycurl.POST, 1)
conn.setopt(pycurl.POSTFIELDS, WORDS)
conn.setopt(pycurl.HTTPHEADER, ["Connection: keep-alive", "Keep-Alive: 3000"])
conn.setopt(pycurl.USERPWD, "%s:%s" % (USER, PASS))
conn.setopt(pycurl.URL, STREAM_URL)
conn.setopt(pycurl.WRITEFUNCTION, on_tweet)
conn.perform() 

We're relying on the REST API from Twitter to return our Tweets.  The same options we are sending to pycurl produce the same effects as if we had run the following command on the command prompt:

"curl -d track=#occupywallstreet http://stream.twitter.com/1/statuses/filter.json -umyuser:mypass"

The line:
db = connection.occupywallstreet
is where we make the connection to the Mongo Database. This requires that I have MongoDB up and running and have created a database called occupywallstreet. The command:
db.posts.insert(tweet)
places the JSON object into the database. You can then query and search for tweets using MongoDB queries. Please see Querying - MongoDB for more information on how to query the database and MongoDB for general MongoDB information.

You have to install the pycurl and mongodb plugins for Python. There are various ways to do this. I used 'easy_install' to simply download and install them with essentially no effort.

A key point to making this code run without fault is found in the function on_tweet. Looking at the callback function we have to make our code resilient to the possible noise that can come back from Twitter. If you're ever run 'curl' from the command line you will occasionally see the API return blank lines. We need to account for these blank lines and other non-JSON values the API might return.
def on_tweet(data):
    try:
        tweet = json.loads(data)
        db.posts.insert(tweet)
        print tweet
    except:
        return

I print out all tweets just so I can verify the program continues to run. I don't follow the tweets but if I fail to see tweets streaming across my terminal I know something went wrong.

And thus in just 27 Python lines we have a nice program that stores all tweets containing the #occupywallstreet hashtag into a Mongo Databse.