Monday, November 14, 2011

What to do with my Twitter data once it is in my MongoDB?





My previous blog showed us how we can use Python, pycurl, pymongo, MongoDB and the Twitter Streaming API to import all tweets of a certain hashtag into our database. Once we have all of that data, how can we parse it so we can effectively use it? My last example collected the entire tweet.

Tweets, though limited to only 140 characters, are actually large when you observer the entire JSON object. (Recall the API returns the Tweet as a JSON object). An example tweet shows the large JSON structure. There is a lot of information in a Tweet so capturing the entire thing is worthwhile, especially since it is just a a few bytes of storage/tweet. However we'll need to parse each tweet to analyze the structure of our dataset. I won't get into the specifics of how to use JSON or the entire Twitter JSON object, but one will have to have have a general understand of how to use JSON to fully understand how we go about the example show below.

So, let's say we want to map the social network at play for a twitter database. We would want to extract the userID of the tweeter and whatever other users they tweet about. We can query the Database for certain fields of each tweet. We will want the entities.mentions.screen_names[array] and the user.screenname string. We'll loop through all of our tweets and print out a list of edges that would otherwise form a social graph. In this example if a user does not tweet about anyone, I still capture the tweet and show the link as a self-loop in order to capture the Out-Degree (for network analysis reasons) of each 'Tweeter.

So, the sample code would be:
import pymongo
import json

from  pymongo import Connection
connection = Connection()
db = connection.occupywallstreet
print db.posts.count()
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1, 'user.screen_name':1}).sort('user.screen_name', 1):
    if len(post['entities']['user_mentions']) == 0:
        print post['user']['screen_name'], post['user']['screen_name'] 
    else:
        for sname in post['entities']['user_mentions']:
            print post['user']['screen_name'], sname['screen_name']
buffer = ""
for post in db.posts.find({}, {'user.profile_image_url_https':1, 'user.screen_name':1}).sort('user.screen_name', 1):
    if buffer == post['user']['screen_name']:
        continue
    print post['user']['screen_name'], post['user']['profile_image_url_https']
    buffer = post['user']['screen_name']

It's pretty straight forward. We connect to the database, perform a query where we only return the screen_names of those a users mentions and the screen_name of the tweeter himself. This is accomplished with the following line:


for post in db.posts.find({}, {'entities.user_mentions.screen_name':1, 'user.screen_name':1}).sort('user.screen_name', 1):

The
.sort('user.screen_name', 1)
sorts the output so you have all of the activity per user in order.

The last loop gives me the image of the Twitter user. My end goal is to visualize this network in NodeXL and I will want to use the profile_image of the user as the shape of the node. Thus I iterate over all users and capture the profile_image_url_https value for each user with the following block of code:

for post in db.posts.find({}, {'user.profile_image_url_https':1, 'user.screen_name':1}).sort('user.screen_name', 1):
    if buffer == post['user']['screen_name']:
        continue
    print post['user']['screen_name'], post['user']['profile_image_url_https']
    buffer = post['user']['screen_name']

When it is all said and done I have all edges of my network along with URL's of the profile_image for each user in the database that Tweets.

Up next I'll share some visualizations I created with data I gathered using these methods.

1 comment:

  1. Thank you for posting this sample code. It helped me a lot. After printing the screen_names stored in mongodb I am getting a key error : 'user'. Can you please help me with this! I have to store the screen names separately in a list or dictionary but this key error is not allowing me to do so.
    Note: I am new to python language.
    Any help would be Appreciated!

    ReplyDelete