9.5. Writing of XLOG Records

Having finished the warm-up exercises, we are now ready to understand the writing of XLOG records. This section will provide a comprehensive overview of the process.

First, issue the following statement to explore the PostgreSQL internals:

testdb=# INSERT INTO tbl VALUES ('A');

By issuing the above statement, the internal function exec_simple_query() is invoked. The pseudocode of exec_simple_query() is shown below:

exec_simple_query() @postgres.c

(1) ExtendCLOG() @clog.c                  /* Write the state of this transaction
                                           * "IN_PROGRESS" to the CLOG.
                                           */
(2) heap_insert()@heapam.c                /* Insert a tuple, creates a XLOG record,
                                           * and invoke the function XLogInsert.
                                           */
(3)   XLogInsert() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write the XLOG record of the inserted tuple
                                           *  to the WAL buffer, and update page's pd_lsn.
                                           */
(4) finish_xact_command() @postgres.c     /* Invoke commit action.*/
      XLogInsert() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write a XLOG record of this commit action
                                           * to the WAL buffer.
                                           */
(5)   XLogWrite() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write and flush all XLOG records on
                                           * the WAL buffer to WAL segment.
                                           */
(6) TransactionIdCommitTree() @transam.c  /* Change the state of this transaction
                                           * from "IN_PROGRESS" to "COMMITTED"
                                           * on the CLOG.
                                           */

The following paragraphs explain each line of the pseudocode to illustrate XLOG record writing. For a visual representation, refer to Figures 9.11 and 9.12.

The function ExtendCLOG() writes the state of this transaction ‘IN_PROGRESS’ in the (in-memory) CLOG.
The function heap_insert() inserts a heap tuple into the target page in the shared buffer pool, creates the XLOG record for that page, and invokes the function XLogInsert().
The function XLogInsert() writes the XLOG record created by the heap_insert() to the WAL buffer at LSN_1, and then updates the modified page’s pd_lsn from LSN_0 to LSN_1.
The function finish_xact_command(), which invoked to commit this transaction, creates the XLOG record for the commit action, and then the function XLogInsert() writes this record to the WAL buffer at LSN_2.

Figure 9.11. Write-sequence of XLOG records.
*The format of these XLOG records is version 9.4.*

The function XLogWrite() writes and flushes all XLOG records on the WAL buffer to the WAL segment file.
If the parameter wal_sync_method is set to ‘open_sync’ or ‘open_datasync’, the records are synchronously written because the function writes all records with the open() system call specified the flag ‘O_SYNC’ or ‘O_DSYNC’.
If the parameter is set to ‘fsync’, ‘fsync_writethrough’ or ‘fdatasync’, the respective system call – fsync(), fcntl() with F_FULLFSYNC option, or fdatasync() – will be executed. In any case, all XLOG records are ensured to be written into the storage.
The function TransactionIdCommitTree() changes the state of this transaction from ‘IN_PROGRESS’ to ‘COMMITTED’ on the CLOG.

Figure 9.12. Write-sequence of XLOG records. (continued from Figure 9.11)

In the above example, the commit action caused the writing of XLOG records to the WAL segment, but such writing may be caused by any of the following:

One running transaction has committed or aborted.
The WAL buffer has been filled up with many tuples. (The WAL buffer size is set to the parameter wal_buffers.)
A WAL writer process writes periodically. (See the next section.)

If any of the above occurs, all WAL records on the WAL buffer are written into a WAL segment file regardless of whether their transactions have been committed or not.

Direct I/O

PostgreSQL versions 15 or earlier do not support direct I/O, although it has been discussed. Refer to this discussion on the pgsql-ML and this article.

In version 16, the debug-io-direct option has been added. This option is for developers to improve the use of direct I/O in PostgreSQL. If development goes well, direct I/O will be officially supported in the near future.

9.5.1. Remark on Writing XLOG records

It is well understood that DML (Data Manipulation Language) operations generate XLOG records. However, non-DML operations can also create XLOG records. For instance:

A commit action writes an XLOG record containing the ID of the committed transaction.
A checkpoint action writes an XLOG record containing general information about the checkpoint.

In special cases, even SELECT statements can generate XLOG records, despite typically not doing so. For example:

A SELECT FOR UPDATE statement generates XLOG records for all target tuple locks (ROW SHARE LOCK) ¹.
During Heap-Only Tuple (HOT) operations, where unnecessary tuples are deleted and necessary tuples are defragmented within pages, XLOG records for the modified pages are written to the WAL buffer.

Additionally, if the wal_level parameter is set to replica or higher, ACCESS EXCLUSIVE LOCKS are also recorded as XLOG records. This occurs not only when an ACCESS EXCLUSIVE LOCK is explicitly acquired using the LOCK command but also when commands like DROP TABLE and TRUNCATE (which acquire EXCLUSIVE LOCKS) are executed, as described in the official document.

Occasionally, users report unexpected increases in WAL segment consumption despite executing only SELECT commands. In such cases, check whether SELECT FOR UPDATE commands are being run. ↩︎

9.5. Writing of XLOG Records

Figure 9.11. Write-sequence of XLOG records.

Figure 9.12. Write-sequence of XLOG records. (continued from Figure 9.11)

9.5.1. Remark on Writing XLOG records