Interesting discussion led by Michael Brown, CTO at comScore at our CTO Roundtable breakfast meeting.
Michael and his team manage and analyze one of the largest datasets processed by a local company. It is fascinating how thinking has to evolve when the volumes reaches billions of rows of data per day. Some of the key takeaways for me were:
- Leverage sort before processing
- Shard the dataset to create smaller more manageable files
- Parallel processing improves turnaround and allows for scalability
- Open source tools have reached a level of maturity to be relevant solutions to consider
- Most of the processing is now accomplished on smaller commodity machines
- Machine memory is more of a limiting resource than hard disk
- Solid-state disks can be selectively used to improve performance for IO intensive processes
- Security is best managed by selectively separating and focusing on securing sensitive data
- SQL is still king as it is very easily grasped by non-technical folks. Therefore, it is easier to find and develop talent with SQL skills.