Sitemap

Understanding Database Storage and B-Trees: Where and How Data is Stored

4 min readNov 19, 2024
Press enter or click to view image in full size

When you want to understand database, you should ask these two fundamental questions:

  1. Where is the data stored?
  2. How does the database write or retrieve it?

To answer these, let’s break it down into two parts: Where and How.

Part 1: Where is the data stored?

Imagine this scenario: you insert a new record into a database. The data eventually needs to be written to the operating system’s disk. But how does it get there? Let’s trace it step by step:

1. Disk and Data Files

Databases store their data in structured data files on the disk. These files are not human-readable because:

  • The data is often encrypted at rest for security purposes.
  • It’s typically compressed to optimize storage space.

This is why, if you inspect a database file directly (e.g., using cat), it looks like gibberish. Only the database system can decode this, much like how Nepali calligraphy is understandable only to Nepali readers.

2. Data Files Contain Pages

Data files are divided into fixed-size units called pages (or blocks). In most relational databases (e.g., PostgreSQL), each page is 8KB in size. These pages are the smallest unit of I/O in the database system.

3. Pages Contain Tuples

Inside these pages are tuples, which represent the rows of a table.

4. Tuples Contain Data

Finally, within the tuples, you find the actual data values — the fields, columns, and their respective values.

This explains the WHERE part: The data resides in pages within data files on the disk, and the database system organizes this structure to allow efficient access.

Part 2: How does the database write or retrieve data?

Now, let’s address the HOW part. Suppose you execute a query to read or write data — like a SELECT or INSERT. How does the database know where to look or where to write?

Efficient data handling is key here. The database doesn’t simply search through files linearly — it uses advanced mechanisms to search, sort, and retrieve data efficiently. Let’s break this process down:

1. The Role of Indexes

Think of a dictionary. If you need to update the definition of a word starting with “M,” you wouldn’t start searching from “A.” Instead, you’d go to the index.

Databases use a similar concept:

  • They create indexes (commonly a B-Tree) to organize and quickly locate data.

A B-Tree consists of:

  • Root nodes: The top-level entry point.
  • Internal nodes: Branches that further refine the search.
  • Leaf nodes: These point to the actual pages containing data.

Indexes act as guides, helping the database efficiently locate where the data resides.

2. Using the Index

Here’s how a database processes your query:

  • It uses the B-Tree to navigate to the leaf node that contains a pointer to the page where the data resides.
  • Once the page is identified, it is loaded from the disk into memory for further processing. (Why load into memory? Homework for you 😉)

3. Filtering Data

If your query includes a condition (e.g., a WHERE clause), the database applies filters at this stage to match your criteria.

4. Assembling Results

After filtering, the relevant tuples (rows) are retrieved. These rows contain the fields, columns, and data you requested, and the database presents the result in tabular form.

Summarizing the Process

To summarize, when you execute a read query:

  1. The database uses an index (e.g., a B-Tree) to find the leaf node pointing to the correct page.
  2. The leaf node directs the database to the physical page on disk.
  3. The page is loaded into memory.
  4. Filters (e.g., WHERE clauses) are applied to extract relevant tuples.
  5. The result is assembled and returned.

And when you execute a write query, a similar process occurs, but in reverse:

  1. The database determines where to store the new data (based on table structure and indexes).
  2. The data is written to the appropriate page in memory.
  3. The page is then marked as “dirty” and later flushed to disk (via a process like checkpointing).
  4. Indexes are updated to reflect the new data.
  5. The database identifies where the data should go, updates the index, and writes the data to the appropriate page.

Understanding WHERE and HOW helps you know how a databases handle your data, making you a more informed engineer. IMHO, databases are a very fascinating piece of technology. They are simple yet very sophisticated, which we often take for granted.

There are other interesting topics, beyond the essentials that databases offer like Write-Ahead Logging (WAL), logical replication, and shared libraries/extensions to enhance performance and scalability. Each of these features opens up a whole new world of possibilities for optimizing database systems. I highly recommend exploring these advanced topics when you have the time.

--

--

Prabesh
Prabesh

Written by Prabesh

Senior Site Reliability Engineer & Backend Engineer | Docker Captain 🐳 | https://www.linkedin.com/in/prabeshthapa

Responses (1)