8.2. Buffer Manager Structure

The PostgreSQL buffer manager comprises three layers: the buffer table, buffer descriptors, and buffer pool (Figure 8.3).

Figure 8.3. Buffer manager's three-layer structure.
  • Buffer pool: An array that stores data file pages. Each slot in the array is identified by a buffer_id.

  • Buffer descriptors: An array of buffer descriptors. Each descriptor has a one-to-one correspondence to a buffer pool slot and holds the metadata of the page stored in that slot.

  • Buffer table: A hash table that stores the relationship between the buffer_tags of stored pages and the buffer_ids of the descriptors holding their metadata.

Note: The term “buffer descriptors layer” is adopted for convenience in this documentation.

8.2.1. Buffer Table

A buffer table logically divides into three parts: a hash function, hash bucket slots, and data entries (Figure 8.4).

The built-in hash function maps buffer_tags to hash bucket slots. Even though hash bucket slots outnumber buffer pool slots, collisions may occur. Therefore, the buffer table uses separate chaining with linked lists to resolve collisions. When data entries map to the same bucket slot, this method stores them in the same linked list (Figure 8.4).

Figure 8.4. Buffer table.

A data entry comprises two values: the buffer_tag of a page and the buffer_id of the descriptor holding that page’s metadata. For example, the entry ‘Tag_A, id=1’ indicates that the descriptor with buffer_id ‘1’ stores metadata for the page tagged with Tag_A.

Hash Function

The hash function is a composite of calc_bucket() and hash().

The following is its representation as a pseudo-function:

uint32 bucket_slot = calc_bucket(unsigned hash(BufferTag buffer_tag), uint32 bucket_size)

8.2.2. Buffer Descriptor

A buffer descriptor holds the metadata of the page stored in the corresponding buffer pool slot. The BufferDesc structure defines this descriptor (see buf_internals.h).

/*
 * Flags for buffer descriptors
 */
#define BM_LOCKED		(1U << 22)	/* buffer header is locked */
#define BM_DIRTY		(1U << 23)	/* data needs writing */
#define BM_VALID		(1U << 24)	/* data is valid */
#define BM_TAG_VALID		(1U << 25)	/* tag is assigned */
#define BM_IO_IN_PROGRESS	(1U << 26)	/* read or write in progress */
#define BM_IO_ERROR		(1U << 27)	/* previous I/O failed */
#define BM_JUST_DIRTIED		(1U << 28)	/* dirtied since write started */
#define BM_PIN_COUNT_WAITER	(1U << 29)	/* have waiter for sole pin */
#define BM_CHECKPOINT_NEEDED	(1U << 30)	/* must write for checkpoint */
#define BM_PERMANENT		(1U << 31)	/* permanent buffer (not unlogged,
						 * or init fork) */

#define PG_HAVE_ATOMIC_U32_SUPPORT
typedef struct pg_atomic_uint32
{
	volatile uint32 value;
} pg_atomic_uint32;

typedef struct BufferDesc
{
	BufferTag	tag;			/* ID of page contained in buffer */
	int		buf_id;			/* buffer's index number (from 0) */

	/* state of the tag, containing flags, refcount and usagecount */
	pg_atomic_uint32 state;

	int		wait_backend_pgprocno;	/* backend of pin-count waiter */
	int		freeNext;		/* link in freelist chain */
	PgAioWaitRef	io_wref;		/* set iff AIO is in progress */
	LWLock		content_lock;	/* to lock access to buffer contents */
} BufferDesc;

The main fields include:

  • tag: Holds the buffer_tag of the stored page.
  • buf_id: Identifies the descriptor.
  • content_lock: A light-weight lock for page access. Refer to Section 8.3.2 for details.
  • freeNext: A pointer for the freelist.
  • state: A single 32-bit field that holds three variables: flags, usage_count, and refcount.

Figure 8.5 illustrates the components of the state field: flags (10 bits), usage_count (4 bits), and refcount (18 bits).

Figure 8.5. Structure of the state field in BufferDesc.
8.2.2.1. Details of the state field

The variables in the state field are as follows:

  • flags: Stores several states of the page. The primary states are as follows:
    • dirty bit (BM_DIRTY): Indicates the page is dirty.
    • valid bit (BM_VALID): Indicates the page is valid and can be read or written.
    • io_in_progress bit (BM_IO_IN_PROGRESS): Indicates whether the buffer manager is currently reading or writing the page to storage.
  • usage_count: Tracks the number of times the page has been accessed since being loaded. The page replacement algorithm, described in Section 8.4.4, uses this value.
    (This value is capped at $15$ (= $2^{4} - 1$) because the page replacement algorithm functions optimally within this range.)

  • refcount (also called pin count): Holds the number of PostgreSQL processes currently accessing the page.
    A process must increment this value (refcount++) when accessing the page and decrement it (refcount--) afterward.
    The page is unpinned if the refcount is zero; otherwise, it is pinned.

8.2.2.2. Operations on the state field

CPU atomic operations update these variables to ensure efficiency.

Atomic Operations

An atomic operation executes as a single, indivisible unit without interruption. This ensures consistency when multiple processes access the same value concurrently without explicit locks.

For example, to change the usage_count, the system masks the relevant bits and updates the entire 32-bit state at once using an atomic operation. This approach improves performance by avoiding spinlock acquisition for these updates.

Henceforth, this documentation expresses operations on these variables in simplified form. For example, incrementing refcount is expressed as “refcount++”, similar to a regular C variable.

Historical Information

Until version 9.5, flags, usage_count, and refcount were separate variables. Operations used a spinlock to avoid conflicts with other processes.

For example, each time usage_count was incremented, the system acquired a spinlock as follows:

LockBufHdr(bufferdesc); /* Acquire a spinlock */
bufferdesc->usage_count++;
UnlockBufHdr(bufferdesc); /* Release the spinlock */
8.2.2.3. Descriptor States

For simplicity, this documentation defines three descriptor states:

  • Empty: The corresponding buffer pool slot does not store a page (refcount and usage_count are 0).
  • Pinned: The slot stores a page and at least one process is accessing it (refcount $\ge$ 1).
  • Unpinned: The slot stores a page but no processes are currently accessing it (usage_count $\ge$ 1, but refcount is 0).

The following colored boxes represent these states in subsequent figures:

In the following figures, buffer descriptors’ states are represented by coloured boxes.

  •      (white) Empty
  •      (blue) Pinned
  •      (aqua blue) Unpinned

In addition, a dirty page is denoted as ‘X’. For example, an unpinned dirty descriptor is represented by  X .

8.2.3. Buffer Descriptors Layer

An array of buffer descriptors forms the buffer descriptors layer.

When the PostgreSQL server starts, all buffer descriptors are ’empty’. These descriptors form a linked list called the freelist (Figure 8.6).

Figure 8.6. Buffer manager initial state.
Note

The freelist in PostgreSQL is a different concept from the freelists in Oracle.

The PostgreSQL freelist is simply a linked list of empty buffer descriptors. In PostgreSQL, freespace maps (FSM), described in Section 5.3.4, serve the same purpose as the Oracle freelists.

Figure 8.7 shows how the first page is loaded.

  • (1) Retrieve an empty descriptor from the top of the freelist and pin it (i.e., increase its refcount and usage_count by 1).
  • (2) Insert a new entry into the buffer table to map the tag of the first page to the buffer_id of the retrieved descriptor.
  • (3) Load the new page from storage into the corresponding buffer pool slot.
  • (4) Save the metadata of the new page to the retrieved descriptor.

The second and subsequent pages are loaded in a similar manner. Section 8.4.2 provides additional details.

Figure 8.7. Loading the first page.

Descriptors retrieved from the freelist always hold page metadata. These non-empty descriptors do not return to the freelist once used. However, the corresponding descriptors return to the freelist and their state is set to ’empty’ when one of the following occurs:

  1. Tables or indexes are dropped.
  2. Databases are dropped.
  3. Tables or indexes are cleaned up using the VACUUM FULL command.
Why empty descriptors comprise the freelist?

The freelist allows for the immediate retrieval of the first descriptor. This is a common practice for dynamic memory resource allocation. For more information, refer to this description.

The buffer descriptors layer contains an unsigned 32-bit integer variable, nextVictimBuffer. This variable is used the page replacement algorithm, described in Section 8.4.4.

8.2.4. Buffer Pool

The buffer pool is a simple array that stores data file pages, such as tables and indexes. The indices of the buffer pool array are called buffer_ids.

The buffer pool slot size is 8 KB, which is equal to the page size. Therefore, each slot can store an entire page.