Scaling your Data Platform using Scala
Rajesh Muppalla, Indix.
Scaling your Data Platform using Scala
Rajesh Muppalla
- rajesh@indix.com
About Me
Co-Founder @
Indix
Data Platform Lead
Focus Areas
Big Data
Distributed Systems
Ex-Thoughtworks
Worked on
Go
- A continuous delivery product
About Indix
Product Intelligence Platform
Founded - Jan 2012
50 Person team in Chennai & Seattle
Big Data Company
Some Stats
250 M Products
10 B Prices
3 TB data crawled daily
Tech Stack @ Indix
Data Pipeline @ Indix
Scala @ Indix
Primary Language for Data Platform & Analytics team
Data Collection (Crawling)
Akka
Data Processing
Scalding
Spark
Metrics & Dashboard
Play
Data Collection (Crawling)
Crawler - Requirements
Distributed
Polite
Focused
Efficient
Fault Tolerant
Extensible
Our Options
Existing Open Source
Nutch, Heritrix
Build Your Own
Work Distribution
Cluster Management
Storage
Algorithms
Our Choice - Akka
Why Akka?
Simpler
Concurrency
Program at a higher level
Don`t think
threads, shared state
Scalability
Distributed by Design
Fault Tolerance
So what`s the
secret sauce?
Actors
What is an Actor?
Unit of computation in Akka
Message passing concurrency
Event Driven
a mailbox
behavior & state
scheduled to run when sent a message
Used since 1973
in Telecom with 9 nines availability
With a Diagram
Show me the code
Hello World
Parallelism
Scale by creating multiple actor instances
Order is not guaranteed
Supervision
Manage other Actor’’s failure
Termination messages sent to Supervisor
Separates processing & error handling
Multiple Strategies
One-For-One
All-For-One
Clustering
Released in Akka 2.2
Gossip based Cluster Membership
Failure Detector
Cluster DeathWatch
Our Setup
50 Nodes on EC2
40 Crawlers (Fetch Agents)
5 URL Frontiers
5 HDFS Data Nodes
Patterns Used
Supervision Strategy
Work Pulling
Throttling
Periodic Message Scheduling
Lessons Learned
The Actor Model
Like OO - New way of thinking
Actor composition
Message design
Externalize business logic
No Blocking
Degraded performance due to starvation
Separate thread pool
Lessons Learned (Continued)
Running on EC2
Unreliable Network
Failure detector tuning
Message Tracing - Debugging
Unique identifier for all messages
Splunk to visualize the flow
Data Processing
Requirements
Batch Processing
Approx 1M HTML pages / hr
200 M products
8 B prices
Long Running
Run in reasonable time
Fault Tolerant
Map Reduce
First Attempt - Java
Second Attempt - PIG
Third Attempt - Scalding
Why Scala?
Features to implement a DSL
Map/Reduce is within the functional paradigm
Collection api covers all use cases
API is very Scala like
Scalding Model
Built on top of Cascading
Source and Sinks - … Read and write data - … From HDFS, DBs, Memcache etc.
Pipes represents the flows of data in the job
Where do we use Scalding?
Daily Batch Jobs
Latest products and prices to production
About 100 jobs run daily
Our Data Pipeline
Problems
Not Iterative
Real Time processing
Spark
What is Apache Spark?
In-memory analytics engine
Compatible with Hadoop Storage APIs
Upto 40x faster than Hadoop
For memory intensive operations
Where do we use Spark?
Clustering for our Matching
Resources
Akka Documentation
LetItCrash Blog
Coursera Course - Reactive Programming
Questions
Thanks
Extras
Map Reduce Example
Cluster - Advanced
Gossip Convergence
Leader Election
Fork me on Github