Saturday, November 12, 2011

How to use Twitter's Filtered Streaming API, Python and MongoDB






When I started doing this I didn't see anywhere on the Internet that had the entire solution to the following problem:

"Track a hashtag following in Twitter and place into a MongoDB via Python".

For example, I needed to grab all Tweets that had the #occupywallstreet hashtag in them and place them in a Mongo Database using Python.

Why MongoDB?  It's easy, efficient and perfect for storing/performing queries on a large number of documents.  When the documents are Tweets encoded as JSON documents, it's even easier.

Why Python?  I had never used Python before but found nice and simple Twitter and MondoDB plugins to make this EASY.

So, to get to the meat of the problem, here is the code:

import pycurl, json
import pymongo

STREAM_URL = "https://stream.twitter.com/1/statuses/filter.json"
WORDS = "track=#occupywallstreet"
USER = "myuser"
PASS = "mypass"

def on_tweet(data):
    try:
        tweet = json.loads(data)
        db.posts.insert(tweet)
        print tweet
    except:
        return

from pymongo import Connection
connection = Connection()
db = connection.occupywallstreet
conn = pycurl.Curl()
conn.setopt(pycurl.POST, 1)
conn.setopt(pycurl.POSTFIELDS, WORDS)
conn.setopt(pycurl.HTTPHEADER, ["Connection: keep-alive", "Keep-Alive: 3000"])
conn.setopt(pycurl.USERPWD, "%s:%s" % (USER, PASS))
conn.setopt(pycurl.URL, STREAM_URL)
conn.setopt(pycurl.WRITEFUNCTION, on_tweet)
conn.perform() 

We're relying on the REST API from Twitter to return our Tweets.  The same options we are sending to pycurl produce the same effects as if we had run the following command on the command prompt:

"curl -d track=#occupywallstreet http://stream.twitter.com/1/statuses/filter.json -umyuser:mypass"

The line:
db = connection.occupywallstreet
is where we make the connection to the Mongo Database. This requires that I have MongoDB up and running and have created a database called occupywallstreet. The command:
db.posts.insert(tweet)
places the JSON object into the database. You can then query and search for tweets using MongoDB queries. Please see Querying - MongoDB for more information on how to query the database and MongoDB for general MongoDB information.

You have to install the pycurl and mongodb plugins for Python. There are various ways to do this. I used 'easy_install' to simply download and install them with essentially no effort.

A key point to making this code run without fault is found in the function on_tweet. Looking at the callback function we have to make our code resilient to the possible noise that can come back from Twitter. If you're ever run 'curl' from the command line you will occasionally see the API return blank lines. We need to account for these blank lines and other non-JSON values the API might return.
def on_tweet(data):
    try:
        tweet = json.loads(data)
        db.posts.insert(tweet)
        print tweet
    except:
        return

I print out all tweets just so I can verify the program continues to run. I don't follow the tweets but if I fail to see tweets streaming across my terminal I know something went wrong.

And thus in just 27 Python lines we have a nice program that stores all tweets containing the #occupywallstreet hashtag into a Mongo Databse.

5 comments:

  1. Thanks for sharing. This is very interesting and close to what I am trying to do.

    Do you think passing a regular expression via WORD would work? (after an "import re" in the header of course)

    ReplyDelete
  2. What do you mean by WORD? Are you referring to my WORDS variable in the beginning? If so I'm not sure that will work. The 'track=#occupywallstreet' is telling the Twitter API to look for those hashtags. If you want just general words you could use regular expressions for all tweets that come across the spritzer feed, but that would limit the tweets AFTER you get them. You would also need to use the sample method instead of filter.

    ReplyDelete
  3. Hi, thanks for the code above. I am using ti to collect tweets for social network research. My quite new to python and programming so can find it all bit hard at times. Rather than print the tweets is there a way of counting them instead so that the number will go up each time a new one is added? Thanks

    ReplyDelete
  4. Hi Gramsky. Thanks for the code. It is very accessible, even for a total beginner like myself.

    A question: If, instead of tracking for WORDS, one wanted to track all tweets from a particular city (say, London) how would one go about it? What would you take out and what would you introduce in the existing bit of code?

    Thanks,
    FW

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete