Papers Big Data Systems
Notable Papers for Big Data Systems
Papers every big data / distributed system practitioner should know (with apologies):
- MapReduce: Simplified Data Processing on Large Clusters
- The Google File System
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- Bigtable: A Distributed Storage System for Structured Data
- Dremel: A Decade of Interactive SQL Analysis at Web Scale
- Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
- In Search of an Understandable Consensus Algorithm
- A Relational Model of Data for Large Shared Data Banks
- Dask: Parallel Computation with Blocked algorithms and Task Scheduling
- Hidden Technical Debt in Machine Learning Systems
- Borg, Omega, and Kubernetes
Other notable links:
- Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
- Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes
- Why you should pick strong consistency, whenever possible
- Towards a One Size Fits All Database Architecture
- How Apache Kafka Inspired Our Platform Events Architecture
- SageDB: A Learned Database System
- Keeping CALM: When Distributed Consistency Is Easy
- Dremel made simple with Parquet
- Predicate Pushdown in Parquet and Apache Spark
- On Byzantine Fault Tolerance in Multi-Master Kubernertes Clusters
- The Beginner’s Guide to Distributed Computing
- Testing Distributed Systems
- Lessons on ML Platforms — from Netflix, DoorDash, Spotify, and more
- The Secrets of Productive Developer Tools
- Deployment Archetypes for Cloud Applications
- Distributed Systems Lectures