8.2. Buffer Manager Structure

The PostgreSQL buffer manager comprises three layers: the ‘buffer table’, ‘buffer descriptors’ and ‘buffer pool’ (Figure 8.3):

Figure 8.3. Buffer manager's three-layer structure.

Buffer pool: An array that stores data file pages. Each slot in the array is referred to as buffer_ids.
Buffer descriptors: An array of buffer descriptors. Each descriptor has a one-to-one correspondence to a buffer pool slot and holds the metadata of the stored page in the corresponding slot.
Note that the term ‘buffer descriptors layer’ has been adopted for convenience and is only used in this document.
Buffer table: A hash table that stores the relations between the buffer_tags of stored pages and the buffer_ids of the descriptors that hold the stored pages’ respective metadata.

These layers are described in detail in the following subsections.

8.2.1. Buffer Table

A buffer table can be logically divided into three parts: a hash function, hash bucket slots, and data entries (Figure 8.4).

The built-in hash function maps buffer_tags to the hash bucket slots. Even though the number of hash bucket slots is greater than the number of buffer pool slots, collisions may occur. Therefore, the buffer table uses a separate chaining with linked lists method to resolve collisions. When data entries are mapped to the same bucket slot, this method stores the entries in the same linked list, as shown in Figure 8.4.

A data entry comprises two values: the buffer_tag of a page, and the buffer_id of the descriptor that holds the page’s metadata. For example, a data entry ‘Tag_A, id=1’ means that the buffer descriptor with buffer_id ‘1’ stores metadata of the page tagged with Tag_A.

Hash Function

The hash function is a composite function of calc_bucket() and hash().

The following is its representation as a pseudo-function.

uint32 bucket_slot = calc_bucket(unsigned hash(BufferTag buffer_tag), uint32 bucket_size)

Note: Basic operations (lookup, insertion, and deletion of data entries) are not explained here. These are very common operations and are explained in the following sections.

8.2.2. Buffer Descriptor

The structure of buffer descriptors was improved in version 9.6 This section first explains buffer descriptors in versions 9.5 or earlier, followed by a discussion of the changes introduced in versions 9.6 or later.

The buffer descriptors layer is described in the next subsection.

8.2.2.1. Versions 9.5 or earlier

The buffer descriptor structure in versions 9.5 and earlier holds the metadata of the stored page in the corresponding buffer pool slot. The buffer descriptor structure is defined by the BufferDesc structure. The following are some of the main fields:

BufferDesc

/*
 * Flags for buffer descriptors
 *
 * Note: TAG_VALID essentially means that there is a buffer hashtable
 * entry associated with the buffer's tag.
 */
#define BM_DIRTY                (1 << 0)    /* data needs writing */
#define BM_VALID                (1 << 1)    /* data is valid */
#define BM_TAG_VALID            (1 << 2)    /* tag is assigned */
#define BM_IO_IN_PROGRESS       (1 << 3)    /* read or write in progress */
#define BM_IO_ERROR             (1 << 4)    /* previous I/O failed */
#define BM_JUST_DIRTIED         (1 << 5)    /* dirtied since write started */
#define BM_PIN_COUNT_WAITER     (1 << 6)    /* have waiter for sole pin */
#define BM_CHECKPOINT_NEEDED    (1 << 7)    /* must write for checkpoint */
#define BM_PERMANENT            (1 << 8)    /* permanent relation (not unlogged) */

src/include/storage/buf_internals.h
typedef struct sbufdesc
{
   BufferTag    tag;                 /* ID of page contained in buffer */
   BufFlags     flags;               /* see bit definitions above */
   uint16       usage_count;         /* usage counter for clock sweep code */
   unsigned     refcount;            /* # of backends holding pins on buffer */
   int          wait_backend_pid;    /* backend PID of pin-count waiter */
   slock_t      buf_hdr_lock;        /* protects the above fields */
   int          buf_id;              /* buffer's index number (from 0) */
   int          freeNext;            /* link in freelist chain */

   LWLockId     io_in_progress_lock; /* to wait for I/O to complete */
   LWLockId     content_lock;        /* to lock access to buffer contents */
} BufferDesc;

tag holds the buffer_tag of the stored page in the corresponding buffer pool slot. (buffer tag is defined in Section 8.1.2.)
buf_id identifies the descriptor. It is equivalent to the buffer_id of the corresponding buffer pool slot.
refcount (also referred to as pin count) holds the number of PostgreSQL processes currently accessing the associated stored page.
When a PostgreSQL process accesses the stored page, the refcount must be incremented by 1 (refcount++). After accessing the page, the refcount must be decreased by 1 (refcount–).
When the refcount is zero, the associated stored page is unpinned, meaning it is not currently being accessed. Otherwise, it is pinned.
usage_count holds the number of times the associated stored page has been accessed since it was loaded into the corresponding buffer pool slot. It is used in the page replacement algorithm (Section 8.4.4).
content_lock and io_in_progress_lock are light-weight locks that are used to control access to the associated stored page. These fields are described in Section 8.3.2.
flags can hold several states of the associated stored page. The main states are as follows:
- dirty bit indicates that the stored page is dirty.
- valid bit indicates whether the stored page is valid, meaning it can be read or written.
  If this bit is valid, then the corresponding buffer pool slot stores a page and the descriptor holds the page metadata, and the stored page can be read or written.
  If this bit is invalid, then the descriptor does not hold any metadata and the stored page cannot be read or written.
- io_in_progress bit indicates whether the buffer manager is reading or writing the associated page from or to storage.
buf_hdr_lock is a spin lock that protects the fields: flags, usage_count, refcount.
freeNext is a pointer to the next descriptor to generate a freelist, which is described in the next subsection.

8.2.2.2. Versions 9.6 or later

The buffer descriptor structure is defined by the BufferDesc structure.

BufferDesc

/*
 * Flags for buffer descriptors
 *
 * Note: BM_TAG_VALID essentially means that there is a buffer hashtable
 * entry associated with the buffer's tag.
 */
#define BM_LOCKED		(1U << 22)	/* buffer header is locked */
#define BM_DIRTY		(1U << 23)	/* data needs writing */
#define BM_VALID		(1U << 24)	/* data is valid */
#define BM_TAG_VALID		(1U << 25)	/* tag is assigned */
#define BM_IO_IN_PROGRESS	(1U << 26)	/* read or write in progress */
#define BM_IO_ERROR		(1U << 27)	/* previous I/O failed */
#define BM_JUST_DIRTIED		(1U << 28)	/* dirtied since write started */
#define BM_PIN_COUNT_WAITER	(1U << 29)	/* have waiter for sole pin */
#define BM_CHECKPOINT_NEEDED	(1U << 30)	/* must write for checkpoint */
#define BM_PERMANENT		(1U << 31)	/* permanent buffer (not unlogged,
						 * or init fork) */

#define PG_HAVE_ATOMIC_U32_SUPPORT
typedef struct pg_atomic_uint32
{
	volatile uint32 value;
} pg_atomic_uint32;


typedef struct BufferDesc
{
	BufferTag	tag;			/* ID of page contained in buffer */
	int		buf_id;			/* buffer's index number (from 0) */

	/* state of the tag, containing flags, refcount and usagecount */
	pg_atomic_uint32 state;

	int		wait_backend_pgprocno;	/* backend of pin-count waiter */
	int		freeNext;		/* link in freelist chain */
	LWLock		content_lock;	/* to lock access to buffer contents */
} BufferDesc;

tag holds the buffer_tag of the stored page in the corresponding buffer pool slot.
buf_id identifies the descriptor.
content_lock is a light-weight lock that is used to control access to the associated stored page.
freeNext is a pointer to the next descriptor to generate a freelist.
states can hold several states and variables of the associated stored page, such as refcount and usage_count.

The flags, usage_count, and refcount fields have been combined into a single 32-bit data (states) to use the CPU atomic operations. Therefore, the io_in_progress_lock and spin lock (buf_hdr_lock) have been removed since there is no longer a need to protect these values.

Atomic Operations

An atomic operation is one that executes as a single, indivisible unit without being interrupted by context switches or other processes. In operating systems, this ensures that even if multiple processes or threads attempt to access the same value concurrently, the operation maintains consistency without requiring explicit synchronization.

These operations are commonly used for tasks such as updating counters without locks or implementing lock-free data structures.

Info

The structure BufferDesc is defined in src/include/storage/buf_internals.h.

8.2.2.3. Descriptor States

To simplify the following descriptions, we define three descriptor states as follows:

Empty: When the corresponding buffer pool slot does not store a page (i.e. refcount and usage_count are 0), the state of this descriptor is empty.
Pinned: When the corresponding buffer pool slot stores a page and any PostgreSQL processes are accessing the page (i.e. refcount and usage_count are greater than or equal to 1), the state of this buffer descriptor is pinned.
Unpinned: When the corresponding buffer pool slot stores a page but no PostgreSQL processes are accessing the page (i.e. usage_count is greater than or equal to 1, but refcount is 0), the state of this buffer descriptor is unpinned.

Each descriptor will have one of the above states. The descriptor state changes depending on certain conditions, which are described in the next subsection.

In the following figures, buffer descriptors’ states are represented by coloured boxes.

(white) Empty
(blue) Pinned
(aqua blue) Unpinned

In addition, a dirty page is denoted as ‘X’. For example, an unpinned dirty descriptor is represented by X .

8.2.3. Buffer Descriptors Layer

A collection of buffer descriptors forms an array, which is referred to as the buffer descriptors layer in this document.

When the PostgreSQL server starts, the state of all buffer descriptors is ’empty’. In PostgreSQL, those descriptors comprise a linked list called freelist (Figure 8.5).

Figure 8.5. Buffer manager initial state.

Note

It is important to note that the freelist in PostgreSQL is completely different concept from the freelists in Oracle. The freelist in PostgreSQL is simply linked list of empty buffer descriptors. In PostgreSQL, freespace maps (FSM), which are described in Section 5.3.4, serve as the same purpose as the freelists in Oracle.

Figure 8.6 shows that how the first page is loaded.

(1) Retrieve an empty descriptor from the top of the freelist, and pin it (i.e. increase its refcount and usage_count by 1).
(2) Insert a new entry into the buffer table that maps the tag of the first page to the buffer_id of the retrieved descriptor.
(3) Load the new page from storage into the corresponding buffer pool slot.
(4) Save the metadata of the new page to the retrieved descriptor.

The second and subsequent pages are loaded in a similar manner. Additional details are provided in Section 8.4.2.

Descriptors that have been retrieved from the freelist always hold page’s metadata. In other words, non-empty descriptors do not return to the freelist once they have been used. However, the corresponding descriptors are added to the freelist again and the descriptor state is set to ’empty’ when one of the following occurs:

Tables or indexes are dropped.
Databases are dropped.
Tables or indexes are cleaned up using the VACUUM FULL command.

Why empty descriptors comprise the freelist?

The freelist is created to allow for the immediate retrieval of the first descriptor. This is a usual practice for dynamic memory resource allocation. For more information, refer to this description.

The buffer descriptors layer contains an unsigned 32-bit integer variable, i.e. nextVictimBuffer. This variable is used in the page replacement algorithm described in Section 8.4.4.

8.2.4. Buffer Pool

The buffer pool is a simple array that stores data file pages, such as tables and indexes. The indices of the buffer pool array are called buffer_ids.

The buffer pool slot size is 8 KB, which is equal to the size of a page. Therefore, each slot can store an entire page.