Papers Big Data Systems

Mar 28, 2022

Notable Papers for Big Data Systems

Papers every big data / distributed system practitioner should know (with apologies):

MapReduce: Simplified Data Processing on Large Clusters
The Google File System
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Bigtable: A Distributed Storage System for Structured Data
Dremel: A Decade of Interactive SQL Analysis at Web Scale
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
In Search of an Understandable Consensus Algorithm
A Relational Model of Data for Large Shared Data Banks
Dask: Parallel Computation with Blocked algorithms and Task Scheduling
Hidden Technical Debt in Machine Learning Systems
Borg, Omega, and Kubernetes

Other notable links:

Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes
Why you should pick strong consistency, whenever possible
Towards a One Size Fits All Database Architecture
How Apache Kafka Inspired Our Platform Events Architecture
SageDB: A Learned Database System
Keeping CALM: When Distributed Consistency Is Easy
Dremel made simple with Parquet
Predicate Pushdown in Parquet and Apache Spark
On Byzantine Fault Tolerance in Multi-Master Kubernertes Clusters
The Beginner’s Guide to Distributed Computing
Testing Distributed Systems
Lessons on ML Platforms — from Netflix, DoorDash, Spotify, and more
The Secrets of Productive Developer Tools
Deployment Archetypes for Cloud Applications
Distributed Systems Lectures

eraya's blog

eraya's blog
eraya@a21an.org

erayaslan
zxdkg

On software development. And perhaps rants on management and hypes and silliness