Building a Distributed Crawler using Akka
Rajesh Muppalla, Indix.
Building a Distributed Crawler using Akka
Rajesh Muppalla
- rajesh@indix.com
About Me
Developer @
Indix
Part of Platform Team
Working on Distributed Systems & Big Data
Ex-Thoughtworks
Worked on
Go
- A continuous delivery product
About Indix
Product Intelligence Platform
Founded - Jan 2012
35 Person team in Chennai & Seattle
Big Data Company
Some Stats
120 M Products
5 B Prices
1.5 TB data crawled daily
Data Pipeline @ Indix
Crawler - Requirements
Distributed
Polite
Focused
Efficient
Fault Tolerant
Extensible
Our Options
Existing Open Source
Nutch, Heritrix
Build Your Own
Work Distribution
Cluster Management
Storage
Algorithms
Our Choice - Akka
Why Akka?
Simpler
Concurrency
Program at a higher level
Don`t think
threads, shared state
Scalability
Distributed by Design
Fault Tolerance
So what`s the
secret sauce?
Actors
What is an Actor?
Unit of computation in Akka
Message passing concurrency
Purely reactive
a mailbox
behavior & state
scheduled to run when sent a message
Used since 1973
in Telecom with 9 nines availability
With a Diagram
Show me the code
Hello World
Parallelism
Scale by creating multiple actor instances
Order is not guaranteed
Supervision
Manage other Actor’’s failure
Termination messages sent to Supervisor
Separates processing & error handling
Multiple Strategies
One-For-One
All-For-One
Clustering
Released in Akka 2.2
Gossip based Cluster Membership
Failure Detector
Cluster DeathWatch
Adaptive Load Balancing
Back to the Crawler
Key Components
URL Frontiers
Site Queues
Priority Queues
URL Router ( 1 per frontier)
Http Fetchers
URL Scheduler
Patterns Used
Supervision Strategy
Work Pulling
Throttling
Periodic Message Scheduling
Lessons Learned
The Actor Model
Like OO - New way of thinking
Actor composition
Message design
Externalize business logic
No Blocking
Degraded performance due to starvation
Separate thread pool
Lessons Learned (Continued)
Running on EC2
Unreliable Network
Failure detector tuning
Message Tracing - Debugging
Unique identifier for all messages
Splunk to visualize the flow
Akka Survey
Future Akka Roadmap
Persistence
Journaling system
Replay messages on failure
Cluster
Optimization (for > 200 nodes)
Re-joining Unreachable Members
Resources
Akka Documentation
LetItCrash Blog
Coursera Course - Reactive Programming
Questions
Thanks
Extras
Map Reduce Example
Cluster - Advanced
Gossip Convergence
Leader Election
Fork me on Github