8.2. Buffer Manager Structure
The PostgreSQL buffer manager comprises three layers: the ‘buffer table’, ‘buffer descriptors’ and ‘buffer pool’ (Figure 8.3):
-
Buffer pool: An array that stores data file pages. Each slot in the array is referred to as buffer_ids.
-
Buffer descriptors: An array of buffer descriptors. Each descriptor has a one-to-one correspondence to a buffer pool slot and holds the metadata of the stored page in the corresponding slot.
Note that the term ‘buffer descriptors layer’ has been adopted for convenience and is only used in this document. -
Buffer table: A hash table that stores the relations between the buffer_tags of stored pages and the buffer_ids of the descriptors that hold the stored pages’ respective metadata.
These layers are described in detail in the following subsections.
8.2.1. Buffer Table
A buffer table can be logically divided into three parts: a hash function, hash bucket slots, and data entries (Figure 8.4).
The built-in hash function maps buffer_tags to the hash bucket slots. Even though the number of hash bucket slots is greater than the number of buffer pool slots, collisions may occur. Therefore, the buffer table uses a separate chaining with linked lists method to resolve collisions. When data entries are mapped to the same bucket slot, this method stores the entries in the same linked list, as shown in Figure 8.4.
A data entry comprises two values: the buffer_tag of a page, and the buffer_id of the descriptor that holds the page’s metadata. For example, a data entry ‘Tag_A, id=1’ means that the buffer descriptor with buffer_id ‘1’ stores metadata of the page tagged with Tag_A.
The hash function is a composite function of calc_bucket() and hash().
The following is its representation as a pseudo-function.
uint32 bucket_slot = calc_bucket(unsigned hash(BufferTag buffer_tag), uint32 bucket_size)
Note: Basic operations (lookup, insertion, and deletion of data entries) are not explained here. These are very common operations and are explained in the following sections.
8.2.2. Buffer Descriptor
The structure of buffer descriptors was improved in version 9.6 This section first explains buffer descriptors in versions 9.5 or earlier, followed by a discussion of the changes introduced in versions 9.6 or later.
The buffer descriptors layer is described in the next subsection.
8.2.2.1. Versions 9.5 or earlier
The buffer descriptor structure in versions 9.5 and earlier holds the metadata of the stored page in the corresponding buffer pool slot.
The buffer descriptor structure is defined by the BufferDesc
structure.
The following are some of the main fields:
-
tag holds the buffer_tag of the stored page in the corresponding buffer pool slot. (buffer tag is defined in Section 8.1.2.)
-
buf_id identifies the descriptor. It is equivalent to the buffer_id of the corresponding buffer pool slot.
-
refcount (also referred to as pin count) holds the number of PostgreSQL processes currently accessing the associated stored page.
When a PostgreSQL process accesses the stored page, the refcount must be incremented by 1 (refcount++). After accessing the page, the refcount must be decreased by 1 (refcount–).
When the refcount is zero, the associated stored page is unpinned, meaning it is not currently being accessed. Otherwise, it is pinned. -
usage_count holds the number of times the associated stored page has been accessed since it was loaded into the corresponding buffer pool slot. It is used in the page replacement algorithm (Section 8.4.4).
-
content_lock and io_in_progress_lock are light-weight locks that are used to control access to the associated stored page. These fields are described in Section 8.3.2.
-
flags can hold several states of the associated stored page. The main states are as follows:
-
dirty bit indicates that the stored page is dirty.
-
valid bit indicates whether the stored page is valid, meaning it can be read or written.
If this bit is valid, then the corresponding buffer pool slot stores a page and the descriptor holds the page metadata, and the stored page can be read or written.
If this bit is invalid, then the descriptor does not hold any metadata and the stored page cannot be read or written. -
io_in_progress bit indicates whether the buffer manager is reading or writing the associated page from or to storage.
-
-
buf_hdr_lock is a spin lock that protects the fields: flags, usage_count, refcount.
-
freeNext is a pointer to the next descriptor to generate a freelist, which is described in the next subsection.
8.2.2.2. Versions 9.6 or later
The buffer descriptor structure is defined by the BufferDesc
structure.
-
tag holds the buffer_tag of the stored page in the corresponding buffer pool slot.
-
buf_id identifies the descriptor.
-
content_lock is a light-weight lock that is used to control access to the associated stored page.
-
freeNext is a pointer to the next descriptor to generate a freelist.
-
states can hold several states and variables of the associated stored page, such as refcount and usage_count.
The flags, usage_count, and refcount fields have been combined into a single 32-bit data (states) to use the CPU atomic operations. Therefore, the io_in_progress_lock and spin lock (buf_hdr_lock) have been removed since there is no longer a need to protect these values.
Instead of exclusive control of data by software, CPU atomic operations use hardware-enforced exclusive control and atomic read-modify-write operations to perform efficiently.
The structure BufferDesc is defined in src/include/storage/buf_internals.h.
8.2.2.3. Descriptor States
To simplify the following descriptions, we define three descriptor states as follows:
-
Empty: When the corresponding buffer pool slot does not store a page (i.e. refcount and usage_count are 0), the state of this descriptor is empty.
-
Pinned: When the corresponding buffer pool slot stores a page and any PostgreSQL processes are accessing the page (i.e. refcount and usage_count are greater than or equal to 1), the state of this buffer descriptor is pinned.
-
Unpinned: When the corresponding buffer pool slot stores a page but no PostgreSQL processes are accessing the page (i.e. usage_count is greater than or equal to 1, but refcount is 0), the state of this buffer descriptor is unpinned.
Each descriptor will have one of the above states. The descriptor state changes depending on certain conditions, which are described in the next subsection.
In the following figures, buffer descriptors’ states are represented by coloured boxes.
- (white) Empty
- (blue) Pinned
- (aqua blue) Unpinned
In addition, a dirty page is denoted as ‘X’. For example, an unpinned dirty descriptor is represented by X .
8.2.3. Buffer Descriptors Layer
A collection of buffer descriptors forms an array, which is referred to as the buffer descriptors layer in this document.
When the PostgreSQL server starts, the state of all buffer descriptors is ’empty’. In PostgreSQL, those descriptors comprise a linked list called freelist (Figure 8.5).
It is important to note that the freelist in PostgreSQL is completely different concept from the freelists in Oracle. The freelist in PostgreSQL is simply linked list of empty buffer descriptors. In PostgreSQL, freespace maps (FSM), which are described in Section 5.3.4, serve as the same purpose as the freelists in Oracle.
Figure 8.6 shows that how the first page is loaded.
-
(1) Retrieve an empty descriptor from the top of the freelist, and pin it (i.e. increase its refcount and usage_count by 1).
-
(2) Insert a new entry into the buffer table that maps the tag of the first page to the buffer_id of the retrieved descriptor.
-
(3) Load the new page from storage into the corresponding buffer pool slot.
-
(4) Save the metadata of the new page to the retrieved descriptor.
The second and subsequent pages are loaded in a similar manner. Additional details are provided in Section 8.4.2.
Descriptors that have been retrieved from the freelist always hold page’s metadata. In other words, non-empty descriptors do not return to the freelist once they have been used. However, the corresponding descriptors are added to the freelist again and the descriptor state is set to ’empty’ when one of the following occurs:
- Tables or indexes are dropped.
- Databases are dropped.
- Tables or indexes are cleaned up using the VACUUM FULL command.
The freelist is created to allow for the immediate retrieval of the first descriptor. This is a usual practice for dynamic memory resource allocation. For more information, refer to this description.
The buffer descriptors layer contains an unsigned 32-bit integer variable, i.e. nextVictimBuffer. This variable is used in the page replacement algorithm described in Section 8.4.4.
8.2.4. Buffer Pool
The buffer pool is a simple array that stores data file pages, such as tables and indexes. The indices of the buffer pool array are called buffer_ids.
The buffer pool slot size is 8 KB, which is equal to the size of a page. Therefore, each slot can store an entire page.