Introduction to efficient data systems

Xiangpeng Hao
Efficient = efficient storage + efficient compute

Question 1

What's the average grade?

Data

What's the average grade?

Option 1: Store as JSON
[
	{
		"id": 1001,
		"name": "Alice Johnson",
		"grade": 85,
		"subject": "Computer Science",
		"enrollmentDate": "2024-09-01"
	},
	{
		"id": 1002,
		"name": "Bob Smith",
		"grade": 92,
		"subject": "Computer Science",
		"enrollmentDate": "2024-09-01"
	},
	{
		"id": 1003,
		"name": "Carol Davis",
		"grade": 78,
		"subject": "Computer Science",
		"enrollmentDate": "2024-09-02"
	},
	{
		"id": 1004,
		"name": "David Wilson",
		"grade": 88,
		"subject": "Mathematics",
		"enrollmentDate": "2024-09-02"
	},
	{
		"id": 1005,
		"name": "Emma Brown",
		"grade": 95,
		"subject": "Mathematics",
		"enrollmentDate": "2024-09-03"
	}
]
How many bytes of data do we have?
~200 bytes
How many bytes do we store?
~436 bytes
Repetitive column names!

What's the average grade?

Option 2: Store as CSV
ID,Name,Grade,Subject,Enrollment Date
1001,Alice Johnson,85,Computer Science,2024-09-01
1002,Bob Smith,92,Computer Science,2024-09-01
1003,Carol Davis,78,Computer Science,2024-09-02
1004,David Wilson,88,Mathematics,2024-09-02
1005,Emma Brown,95,Mathematics,2024-09-03
How many bytes do we read?
~238 bytes
(entire CSV file)
How many bytes do we use?
20 bytes
(only grade column)
Read unused data!

What's the average grade?

Option 3: YAFF-V0
💡 Yet Another File Format, version 0
ID: 1001,1002,1003,1004,1005
Name: Alice Johnson,Bob Smith,Carol Davis,David Wilson,Emma Brown
Grade: 85,92,78,88,95
Subject: Computer Science,Computer Science,Computer Science,Mathematics,Mathematics
Enrollment Date: 2024-09-01,2024-09-01,2024-09-02,2024-09-02,2024-09-03
How many bytes do we read?
118 bytes
(ID to Grade column)
How to only read grade column?
Needs metadata!

What's the average grade?

YAFF-V1

1001,1002,1003,1004,1005
Alice Johnson,Bob Smith,Carol Davis,David Wilson,Emma Brown
85,92,78,88,95
Computer Science,Computer Science,Computer Science,Mathematics,Mathematics
2024-09-01,2024-09-01,2024-09-02,2024-09-02,2024-09-03
id:1; name:2; grade:3; subject:4; enrollmentDate:5
  1. Read metadata (usually small!)
  2. Read selected range
Achievement: read only we need!
Checkpoint 1: row store vs column store
Row store
ID,Name,Grade,Subject,Enrollment Date
1001,Alice Johnson,85,Computer Science,2024-09-01
1002,Bob Smith,92,Computer Science,2024-09-01
1003,Carol Davis,78,Computer Science,2024-09-02
1004,David Wilson,88,Mathematics,2024-09-02
1005,Emma Brown,95,Mathematics,2024-09-03
Column store
1001,1002,1003,1004,1005
Alice Johnson,Bob Smith,Carol Davis,David Wilson,Emma Brown
85,92,78,88,95
Computer Science,Computer Science,Computer Science,Mathematics,Mathematics
2024-09-01,2024-09-01,2024-09-02,2024-09-02,2024-09-03
id:1; name:2; grade:3; subject:4; enrollmentDate:5
CSV
JSON
MySQL
PostgreSQL
SQL server
Parquet
ORC
Arrow
DuckDB
DataFusion

What are the pros and cons?
hint: how to add a row?

Question 2

What's the average grade that are larger than 85?

Data

What's the average grade that are larger than 85?

Option 1: YAFF-V1
2001,2002,2003,2004,2005,2006,2007,2008,2009,2010
Jake Miller,Sophie Chen,Marcus Rodriguez,Luna Park,Ethan Thompson,Maya Patel,Oliver Kim,Isabella Garcia,Noah Williams,Zara Ahmed
63,67,72,75,79,82,84,89,93,97
Physics,Chemistry,Biology,Computer Science,Mathematics,Physics,Chemistry,Biology,Computer Science,Mathematics
2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14
id:1; name:2; grade:3; subject:4; enrollmentDate:5
  1. Read metadata (usually small!)
  2. Read selected range

How to do better?
Data is (roughly) sorted/clustered:
63,67,72,75,79,82,84,89,93,97

What's the average grade that are larger than 85?

YAFF-V2
63,67,72,75,79 | min: 63, max: 79
82,84,89,93,97 | min: 82, max: 97
  1. Chunk the data, each chunk (e.g., 5 rows) maintains its min and max values.
  2. If a chunk's max value is smaller than 85, skip the chunk entirely.

What's the average grade that are larger than 85?

YAFF-V2 (example)

What's the average grade that are larger than 85?

YAFF-V2: chunks + statistics
Overall view of YAFF-V2
Checkpoint 2: Static pruning
Key idea: prune data using pre-computed statistics
Zone map
(min/max/null count)
Filter
(Bloom filter, range filter, etc.)
Index
(B-tree, Hashtable, LSM-tree, etc.)

Question 3

Read all Subjects!

Data

Read all Subjects!

YAFF-V2
Only read subjects Only read subjects
How many bytes do we read?
Can you do better?
Hint: how many strings are unique?

Read all Subjects!

YAFF-V3

Only read subjects
  1. Key idea: replace full strings with symbols.
  2. Store the symbols (dictionary) in the metadata.
Less data to read -> faster

Read all Subjects!

YAFF-V3 (+compression)

Overall view of YAFF-V2
→
Only read subjects
  1. Each column is "dictionary-encoded"
  2. Dictionary is stored in the metadata
Checkpoint 3: Big data needs compression
Key idea: most data has low entropy
General purpose compression
(Bytes in, bytes out)
Snappy, LZ4, ZSTD, etc.
Type-specific compression
(Datatype-dependent compression)
Delta encoding, Run-length encoding, etc.

Question 4

List all names whose grade are larger than 85!

Data

List all names whose grade are larger than 85!

Option 1: read then filter
Step 1: Read both columns
Jake Miller,Sophie Chen,Marcus Rodriguez,Luna Park,Ethan Thompson,Maya Patel,Oliver Kim,Isabella Garcia,Noah Williams,Zara Ahmed
63,67,72,75,79,82,84,89,93,97

Step 2: Filter grade > 85
Isabella Garcia,Noah Williams,Zara Ahmed
Can you do better?

List all names whose grade are larger than 85!

Option 2:
read -> filter -> read
Step 1: Read Grade
63,67,72,75,79,82,84,89,93,97

Step 2: Filter grade > 85
00000_10011

Step 3: Read Name
Isabella Garcia,Noah Williams,Zara Ahmed
Checkpoint 4: dynamic pruning
Key idea: prune data based on what we already read
Filter pushdown
Aggregation pushdown
Dynamic filtering
Side-way information passing
Summary
  1. Row store vs Column store
    → Write-optimized vs Read-optimized
  2. Static pruning
    → Prune data using pre-computed statistics
  3. Compression
    → Most data has low entropy
  4. Dynamic pruning
    → Prune data based on what we already read
Summary
Efficient data system: a beautiful new world!
* Red boxed topics are briefly covered in this lecture.