11.4. Replication Slots

As discussed in Section 11.1.1, Replication slots, introduced in version 9.4, were designed primarily to ensure that WAL segments and old tuple versions are retained long enough to support replication completion.

We will explore replication slots in this section.

Note that although replication slots are fundamental to logical replication, this section does not cover them in that context.

11.4.1. Advantages of Replication Slots in Streaming Replication

In streaming replication, although replication slots are not mandatory, they offer the following advantages compared to using wal_keep_size:

  1. Ensure Streaming Replication Works Without Losing Required WAL Segments:
    Replication slots track which WAL segments are needed and prevent their removal. In contrast, when using only wal_keep_size, necessary segments may be removed if standbys do not read them for an extended period.

  2. Maintain Only the Minimum Necessary WAL Segments:
    With replication slots, only the required WAL segments are kept in the pg_wal directory, while unnecessary segments are removed. Conversely, using wal_keep_size keeps a fixed number of WAL segments, regardless of whether they are needed or not.

max_slot_wal_keep_size

Since Replication Slots can store WAL segments indefinitely, there is a risk of the storage area filling up with stored WAL segments in the worst-case scenario, potentially leading to an operating system panic.

To address this issue, the configuration parameter max_slot_wal_keep_size was introduced in version 13. It limits the maximum size of WAL segments in the pg_wal directory at checkpoint time.

The key difference between using max_slot_wal_keep_size with replication slots and wal_keep_size is in how they manage WAL segments:

  • max_slot_wal_keep_size sets a maximum size limit for WAL segments while allowing replication slots to retain only the minimum required amount.
  • wal_keep_size specifies a fixed number of WAL segments to be retained, regardless of whether they are needed or not.

Replication Slots are stored in the memory area allocated in the shared memory.

Figure 11.8 illustrates replication slots and the related processes and files:

Figure 11.8. Replication Slots and Related Processes and Files.

The processes related to Replication Slots are shown below:

  • Walsender: Continuously updates the corresponding replication slot to reflect the current state of WAL data of the standby server.

  • Checkpointer background worker: Reads the replication slots to determine whether WAL segments can be removed or not during checkpointing.

  • Postgres backend: Displays the slot information using the system view pg_replication_slots.

The files related to Replication Slots are shown below:

  • State files under the pg_replslot directory: Walsenders regularly save detailed information about their replication slots to state files located in this directory. When the server is restarted, it loads this saved information back into memory to restore the status of its replication slots.
    The state is defined by the ReplicationSlotPersistentData structure, described in the next section.

  • WAL segment files under the pg_wal directory: The number of WAL segment files are managed by the checkpointer background worker. The checkpointer prioritizes Replication Slots and wal_keep_size over max_wal_size to ensure essential data is not removed.

11.4.3. Data Structure

Replication Slots are defined by the ReplicationSlot structure in slot.h.

/*
 * Shared memory state of a single replication slot.
 *
 * The in-memory data of replication slots follows a locking model based
 * on two linked concepts:
 * - A replication slot's in_use flag is switched when added or discarded using
 * the LWLock ReplicationSlotControlLock, which needs to be hold in exclusive
 * mode when updating the flag by the backend owning the slot and doing the
 * operation, while readers (concurrent backends not owning the slot) need
 * to hold it in shared mode when looking at replication slot data.
 * - Individual fields are protected by mutex where only the backend owning
 * the slot is authorized to update the fields from its own slot.  The
 * backend owning the slot does not need to take this lock when reading its
 * own fields, while concurrent backends not owning this slot should take the
 * lock when reading this slot's data.
 */
typedef struct ReplicationSlot
{
	/* lock, on same cacheline as effective_xmin */
	slock_t		mutex;

	/* is this slot defined */
	bool		in_use;

	/* Who is streaming out changes for this slot? 0 in unused slots. */
	pid_t		active_pid;

	/* any outstanding modifications? */
	bool		just_dirtied;
	bool		dirty;

	/*
	 * For logical decoding, it's extremely important that we never remove any
	 * data that's still needed for decoding purposes, even after a crash;
	 * otherwise, decoding will produce wrong answers.  Ordinary streaming
	 * replication also needs to prevent old row versions from being removed
	 * too soon, but the worst consequence we might encounter there is
	 * unwanted query cancellations on the standby.  Thus, for logical
	 * decoding, this value represents the latest xmin that has actually been
	 * written to disk, whereas for streaming replication, it's just the same
	 * as the persistent value (data.xmin).
	 */
	TransactionId effective_xmin;
	TransactionId effective_catalog_xmin;

	/* data surviving shutdowns and crashes */
	ReplicationSlotPersistentData data;

	/* is somebody performing io on this slot? */
	LWLock		io_in_progress_lock;

	/* Condition variable signaled when active_pid changes */
	ConditionVariable active_cv;

	/* all the remaining data is only used for logical slots */

	/*
	 * When the client has confirmed flushes >= candidate_xmin_lsn we can
	 * advance the catalog xmin.  When restart_valid has been passed,
	 * restart_lsn can be increased.
	 */
	TransactionId candidate_catalog_xmin;
	XLogRecPtr	candidate_xmin_lsn;
	XLogRecPtr	candidate_restart_valid;
	XLogRecPtr	candidate_restart_lsn;

	/*
	 * This value tracks the last confirmed_flush LSN flushed which is used
	 * during a shutdown checkpoint to decide if logical's slot data should be
	 * forcibly flushed or not.
	 */
	XLogRecPtr	last_saved_confirmed_flush;

	/* The time since the slot has become inactive */
	TimestampTz inactive_since;
} ReplicationSlot;

#define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
#define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)

/*
 * Shared memory control area for all of replication slots.
 */
typedef struct ReplicationSlotCtlData
{
	/*
	 * This array should be declared [FLEXIBLE_ARRAY_MEMBER], but for some
	 * reason you can't do that in an otherwise-empty struct.
	 */
	ReplicationSlot replication_slots[1];
} ReplicationSlotCtlData;

Although the structure contains many items, as it is shared between both streaming and logical replication, the main items relevant to streaming replication are as follows:

  • pid_t active_pid: The PID of the walsender process that manages this slot.
  • ReplicationSlotPersistentData data: Items defined by ReplicationSlotPersistentData structure. The main items include:
    • NameData name: The name of the slot.
    • XLogRecPtr restart_lsn: The oldest LSN that might be required by this replication slot. The checkpointer reads the minimum restart_lsn value across all slots to determine whether WAL segments can be removed.
/*
 * On-Disk data of a replication slot, preserved across restarts.
 */
typedef struct ReplicationSlotPersistentData
{

	NameData	name;

	/* database the slot is active on */
	Oid			database;

	/*
	 * The slot's behaviour when being dropped (or restored after a crash).
	 */
	ReplicationSlotPersistency persistency;

	TransactionId xmin;

	/*
	 * xmin horizon for catalog tuples
	 *
	 * NB: This may represent a value that hasn't been written to disk yet;
	 * see notes for effective_xmin, below.
	 */
	TransactionId catalog_xmin;

	/* oldest LSN that might be required by this replication slot */
	XLogRecPtr	restart_lsn;

	/* RS_INVAL_NONE if valid, or the reason for having been invalidated */
	ReplicationSlotInvalidationCause invalidated;

	/*
	 * Oldest LSN that the client has acked receipt for.  This is used as the
	 * start_lsn point in case the client doesn't specify one, and also as a
	 * safety measure to jump forwards in case the client specifies a
	 * start_lsn that's further in the past than this value.
	 */
	XLogRecPtr	confirmed_flush;

	/*
	 * LSN at which we enabled two_phase commit for this slot or LSN at which
	 * we found a consistent point at the time of slot creation.
	 */
	XLogRecPtr	two_phase_at;

	/*
	 * Allow decoding of prepared transactions?
	 */
	bool		two_phase;

	/* plugin name */
	NameData	plugin;

	/*
	 * Was this slot synchronized from the primary server?
	 */
	char		synced;

	/*
	 * Is this a failover slot (sync candidate for standbys)? Only relevant
	 * for logical slots on the primary server.
	 */
	bool		failover;
} ReplicationSlotPersistentData;

The data stored in ReplicationSlotPersistentData structure is regularly saved in the pg_replslot directory.

11.4.4. Starting Replication Slot

Figure 11.9 illustrates the starting sequence of a replication slot:

Figure 11.9. Starting Sequence of a Replication Slot.
  1. Creating a (physical) replication slot using the pg_create_physical_replication_slot() function.
    Except for the slot name, the data written to the replication slot is set to its default value.

    testdb=# SELECT * FROM pg_create_physical_replication_slot('standby_slot');
       slot_name   | lsn
    ---------------+-----
     standby_slot  |
    (1 row)

  2. Writing a portion of the slot data, defined by the ReplicationSlotPersistentData structure, in the pg_replslot directory.
    A file named state is created under the subdirectory corresponding to the slot name, as shown below:

    $ ls -1 pg_replslot/
    standby_slot
    $ find pg_replslot/
    pg_replslot/
    pg_replslot/standby_slot
    pg_replslot/standby_slot/state

  3. (Re)Connecting standby server to the primary server.
    To (re)connect the standby server to the primary server, set the primary_slot_name configuration parameter to the name of the replication slot.

    # standby's postgresql.conf
    primary_slot_name = 'standby_slot'
    Then, issue the pg_ctl command with the reload option:
    $ pg_ctl -D $PGDATA_STANDBY reload

  4. Updating the replication slot, e.g., active_pid, restart_lsn, etc.

  5. Writing a portion of the updated slot data in the pg_replslot directory.

11.4.5. Managing Replication Slots

After replication slots are set in shared memory, walsender processes continuously update the slots to reflect the current states of the corresponding standby servers.

Below is an example of the states of the replication slots:

testdb=# \x
Expanded display is on.
testdb=# SELECT * FROM pg_replication_slots;
-[ RECORD 1 ]-------+--------------
slot_name           | standby_slot
plugin              |
slot_type           | physical
datoid              |
database            |
temporary           | f
active              | t
active_pid          | 236772
xmin                | 754
catalog_xmin        |
restart_lsn         | 0/303B968
confirmed_flush_lsn |
wal_status          | reserved
safe_wal_size       |
two_phase           | f
inactive_since      |
conflicting         |
invalidation_reason |
failover            | f
synced              | f

The primary PostgreSQL server regularly saves detailed information about its replication slots to state files in the pg_replslot directory. (When the primary server restarts, it loads this saved information back into memory to restore the status of its replication slots.)