Runaway complexity in Big Data… and a plan to stop
Lack human fault-tolerance
worst thing ever: data loss, data corruption otherwise you can restore system to previous version
- Event + particular time => Always true
- restrict the range of errors, more human proof
- Only CR (from CRUD) are easier to implement
Conflation of data
- redundantly store and it’s up to you to keep it in sync among all tables
Schemas done wrong
function(data unit) => is it valid?
- prevent corruption
Apache Thrift as schema tool
Most NoSQL keep mutability and it’s not the right direction
What data system do?
query function(all data)
All data —-> Precomputed view —-> Query
how to compute views?
lock all data -> Map Reduce (arbitrary functions on arbitrary data)
Should write to a DB that: - is batch write - fast random read - elephantDB or Voldmort
All data must be normalized!
Batch view can be denormalized!
Eventually consistency in a couple of hours, for example.
how to compute query?
Precompute realtime view
Performance and Accuracy problem can coexists