Testing Challenges in Big Data Applications
Rajesh Muppalla, Indix.
Testing Challenges in Big Data Applications
Rajesh Muppalla
- rajesh@indix.com
About Me
Developer @
Indix
Part of Platform Team
Working on Distributed Systems & Big Data
Ex-Thoughtworks
Worked on
Go
- A continuous delivery product
About Indix
Product Intelligence Platform
Founded - Jan 2012
45 Person team in Chennai & Seattle
Data Company
Some Stats
150 M Products
6 B Prices
3 TB data crawled daily
Data Pipeline @ Indix
What is Big Data?
3Vs of Big Data
Volume
Terabytes +
Velocity
Real Time, Streaming
Variety
Heterogeneous Sources
Unstructured Data
Traditional Testing
You Test For
Happy Path Scenarios
Failures
Fault Tolerance/Concurrency/Performance
Practices
TDD
Continuous Integration
Mocks/Stubs
What do you test for in Data?
Accuracy
Capturing the real price
Coverage
All products within a e-commerce store
Consistency
Duplicate urls should be aggregated as one
Completeness
All mandatory attributes of a product captured
Some Challenges
URL Canonicalization
Problem Statement
Normalize URLs
RFC 3986
Unit Testing
Gotchas
Sorting Query Params
Removing Double Slashes
Product Tagger
Problem Statement
Unit Testing
Integration Testing
Gotchas
Server is Down/Under Maintenance
Sample page no longer available
Testing Output Data Semantics
Product Matching Accuracy
Problem Statement
Unit Testing
Gotchas
How to Verify Accuracy at Scale?
Complexity is same as implementation
Back to Square One
Changing Algorithms and Data
Problem Statement
Go back in time and re-process
Gotchas
Too much volatility
Testing Techniques
Data Sampling
Take random samples
Do Manual verification
Use Mechanical Turk To Scale
Automate this process
Guard against data bias
Metrics, Monitoring & Alerting
Metrics Funnel @ Indix
Don’‘t monitor absolute counts
Anomaly Detection & Correlation
Use Historical Stats
Remove outliers
Tiered Data Quality
Tier Your Data
Top E-commerce Sites
Top Categories
For Top Tier Data
Larger sample validation
Lower alert thresholds
In Summary
Big Data Testing
is hard
gives great dividends
build on top of traditional testing techniques
needs good data understanding
Very nascent field
ripe for thought leadership
Questions
Thanks
Lambda Architecture
All data is immutable
Three Layers
Batch Layer
Serving Layer
Speed Layer
Lambda Architecture
Human Fault Tolerance
Embrace Human Errors
Reduces Complexity
Rebuild everything on errors
Fork me on Github