I was working on my final year project (BlueIgnis) which uses Streaming Twitter API. I had the following understandings from them (on free version):
- One Account can open only One Streaming Connection at any given time
- One IP may be associated with only One Account while streaming. Rotation of Streaming connections based on multiple accounts are not allowed. May lead to IP Ban. (All the more reason to use EC2 Instances for Streaming :P)
- One Streaming connection may allow upto 400 tracks (different keywords) to filter from.
- Reply to 402 Error codes with proper HTTP Status.
- Should use non-aggressive re-connect policies, must give substantial amount of time in-between subsequent requests.
- Periodically we must stop the Streaming Connection, add more tracks (keywords) to the list and re-start the connection, rather than individual connections for multiple times.
Based on these understandings, I came up with own Architecture for Twitter Streaming. Below diagram represents the overall architecture of my application with respect to Twitter Streaming Component.
Hosting the Twitter Streaming on an EC2 Instance, we can achieve 400 tracks (keywords) per node which can handle approx. 30 - 50 customers based on my use-case. I periodically (~10 min) check if there are any new tracks that needs to be added to the node until it becomes 400. Since I need to know which user requested the track, which is not possible to get from the current way the Streaming API works.
So I decided to build a Local Firehouse, where in I stack all the tweets for all the tracks, all in a single location. Then, I use a FullText Search feature of MySQL (my datastore) to search for the related tweets continously so that I can achieve the feel of a bit delayed streaming yet close-to-realtime processing.
If you have any better ways to get things done, please let me know.