From 957ef431a6a14afe1b387b47fa1e60facda537f5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 25 May 2018 09:43:16 +1200
Subject: [PATCH 3/6] Add developer documentation for the undo log storage
 subsystem.

This document provides an overview of the design.

Author: Thomas Munro
Discussion:
---
 src/backend/access/undo/README  | 169 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/README |  23 +++--
 2 files changed, 184 insertions(+), 8 deletions(-)
 create mode 100644 src/backend/access/undo/README

diff --git a/src/backend/access/undo/README b/src/backend/access/undo/README
new file mode 100644
index 00000000000..9ba81f960de
--- /dev/null
+++ b/src/backend/access/undo/README
@@ -0,0 +1,169 @@
+src/backend/access/undo/README
+
+Undo Logs
+=========
+
+The undo log subsystem provides a way to store data that is needed for
+a limited time.  Undo data is generated whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or rolled back and (2) there is no snapshot that might
+need it for MVCC purposes.  See src/backend/access/zheap/README for
+more information on zheap.  The undo log subsystem is concerned with
+raw storage optimized for efficient recycling and buffered random
+access.
+
+Like redo data (the WAL), undo data consists of records identified by
+their location within a 64 bit address space.  Unlike redo data, the
+addressing space is internally divided up unto multiple numbered logs.
+The first 24 bits of an UndoRecPtr identify the undo log number, and
+the remaining 40 bits address the space within that undo log.  Higher
+level code (zheap) is largely oblivious to this internal structure and
+deals mostly in opaque UndoRecPtr values.
+
+Using multiple undo logs instead of a single uniform space avoids the
+contention that would result from a single insertion point, since each
+session can be given sole access to write data into a given undo log.
+It also allows for parallelized space reclamation.
+
+Like redo data, undo data is stored on disk in numbered segment files
+that are recycled as required.  Unlike redo data, undo data is
+accessed through the buffer pool.  In this respect it is similar to
+regular relation data.  Buffer content is written out to disk during
+checkpoints and whenever it is evicted to make space for another page.
+However, unlike regular relation data, undo data has a chance of never
+being written to disk at all: if a page is allocated and and then
+later discarded without an intervening checkpoint and without an
+eviction provoked by memory pressure, then no disk IO is generated.
+
+Keeping the undo data physically separate from redo data and accessing
+it though the existing shared buffers mechanism allows it to be
+accessed efficiently for MVCC purposes.
+
+Meta-Data
+=========
+
+At any given time the set of undo logs that exists is tracked in
+shared memory and can be inspected in the pg_stat_undo_logs view.  For
+each undo log, a set of properties called the undo log's meta-data are
+tracked:
+
+* the tablespace that holds its segment files
+* the persistence level (permanent, unlogged, temporary)
+* the "discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+* the "end" pointer: a new undo segment file will be needed at this point
+
+The three pointers discard, insert and end move strictly forwards
+until the whole undo log has been exhausted.  At all times discard <=
+insert <= end.  When discard == insert, the undo log is empty
+(everything that has ever been inserted has since been discarded).
+The insert pointer advances when regular backends allocate new space,
+and the discard pointer usually advances when an undo worker process
+determines that no session could need the data either for rollback or
+for finding old versions of tuples to satisfy a snapshot.  In some
+special cases including single-user mode and temporary undo logs the
+discard pointer might also be advanced synchronously by a foreground
+session.
+
+In order to provide constant time access to undo log meta-data given
+an UndoRecPtr, there is conceptually an array of UndoLogControl
+objects indexed by undo log number.  Since that array would be too
+large and since we expect the set of active undo log numbers to be
+small and clustered, we only keep small ranges of that logical array
+in memory at a time.  We use the higher order bits of the undo log
+number to identify a 'bank' (array fragment), and then the lower order
+bits to identify a slot within the bank.  Each bank is backed by a DSM
+segment.  We expect to need just 1 or 2 such DSM segments to exist at
+any time.
+
+The meta-data for all undo logs is written to disk at every
+checkpoint.  It is stored in files under PGDATA/pg_undo/, using the
+checkpoint's redo point (a WAL LSN) as its filename.  At startup time,
+the redo point's file can be used to restore all undo logs' meta-data
+as of the moment of the redo point into shared memory.  Changes to the
+discard pointer and end pointer are WAL-logged by undolog.c and will
+bring the in-memory meta-data up to date in the event of recovery
+after a crash.  Changes to insert pointers are included in other WAL
+records (see below).
+
+Responsibility for creating, deleting and recycling undo log segment
+files and WAL logging the associated meta-data changes lies with
+src/backend/storage/undo/undolog.c.
+
+Persistence Levels and Tablespaces
+==================================
+
+When new undo log space is requested by client code, the persistence
+level of the relation being modified and the current value of the GUC
+"undo_tablespaces" controls which undo log is selected.  If the
+session is already attached to a suitable undo log and it hasn't run
+out of address space, it can be used immediately.  Otherwise a
+suitable undo log must be either found or created.  The system should
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are used).
+
+When an unlogged relation is modified, undo data generated by the
+operation must be stored in an unlogged undo log.  This causes the
+undo data to be deleted along with all unlogged relations during
+recovery from a non-shutdown checkpoint.  Likewise, temporary
+relations require special treatment: their buffers are backend-local
+and they cannot be accessed by other backend including undo workers.
+
+Non-empty undo logs in a tablespace prevent the tablespace from being
+dropped.
+
+Undo Log Contents
+=================
+
+Undo log contents are written into 1MB segment files under
+PGDATA/base/undo/ or PGDATA/pg_tblspc/VERSION/undo/ using filenames
+that encode the address (UndoRecPtr) of their first byte.  A period
+'.'  separates the undo log number part from the offset part, for the
+benefit of human administrators.
+
+Undo logs are page-oriented and use regular PosgreSQL page headers
+including checksums (if enabled) and LSNs.  An UndoRecPtr can be used
+to obtain a buffer and an offset within the buffer, and then regular
+buffer locking and page LSN rules apply.  While space is allocated by
+asking for a given number of usable bytes (not including page
+headers), client code is responsible for stepping over the page
+headers and advancing to the next page.
+
+Responsibility for WAL-logging the contents of the undo log lies with
+client code (ie zheap).  While undolog.c WAL-logs all meta-data
+changes except insert points and checkpoints all meta-data including
+insert points, client code is responsible for allocating undo log
+space in the same sequence at recovery time.  This avoids having to
+WAL-log insertion points explicitly and separately for every insertion
+into an undo log, greatly reducing WAL traffic.  (WAL is still
+generated by undolog.c whenever a 1MB segment boundary is crossed,
+since that also advances the end pointer.)
+
+One complication of this scheme for implicit insert pointer movement
+is that recovery doesn't naturally have access to the association
+between transactions and undo logs.  That is, while 'do' sessions have
+a currently attached undo log from which they allocate new space,
+recovery is performed by a single startup process which has no concept
+of the sessions that generated the WAL it is replaying.  For that
+reason, an xid->undo log number map is maintained at recovery time.
+At 'do' time, a WAL record is emitted the first time any permanent
+undo log is used in a given transaction, so that the mapping can be
+recovered at redo time.  That allows a stream of allocations to be
+directed to the appropriate undo logs so that the same resulting
+stream of undo log pointer can be produced.  (Unlogged and temporary
+undo logs don't have this problem since they aren't used at recovery
+time.)
+
+Another complication is that the checkpoint files written under pg_undo
+may contain inconsistent data during recovery from an online checkpoint
+(after a crash or base backup).  To compensate for this, client code
+must arrange to log an undo log meta-data record when inserting the
+first WAL record that might cause undo log access during recovery.
+This is conceptually similar to full page images after checkpoints,
+but limited to one meta-data WAL record per undo log per checkpoint.
+
+src/backend/storage/buffer/bufmgr.c is unaware of the existence of
+undo log as a separate category of buffered data.  Reading and writing
+of buffered undo log pages is handled by a new storage manager in
+src/backend/storage/smgr/undo_file.c.  See
+src/backend/storage/smgr/README for more details.
diff --git a/src/backend/storage/smgr/README b/src/backend/storage/smgr/README
index 37ed40b6450..641926f8769 100644
--- a/src/backend/storage/smgr/README
+++ b/src/backend/storage/smgr/README
@@ -10,16 +10,14 @@ memory, but these were never supported in any externally released Postgres,
 nor in any version of PostgreSQL.)  The "magnetic disk" manager is itself
 seriously misnamed, because actually it supports any kind of device for
 which the operating system provides standard filesystem operations; which
-these days is pretty much everything of interest.  However, we retain the
-notion of a storage manager switch in case anyone ever wants to reintroduce
-other kinds of storage managers.  Removing the switch layer would save
-nothing noticeable anyway, since storage-access operations are surely far
-more expensive than one extra layer of C function calls.
+these days is pretty much everything of interest.  However, we retained the
+notion of a storage manager switch and it turned out to be useful for plugging
+in a new storage manager to support buffered undo logs.
 
 In Berkeley Postgres each relation was tagged with the ID of the storage
-manager to use for it.  This is gone.  It would be probably more reasonable
-to associate storage managers with tablespaces, should we ever re-introduce
-multiple storage managers into the system catalogs.
+manager to use for it.  This is gone.  While earlier PostgreSQL releases were
+hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo
+pseudo-database to undo_file.c.
 
 The files in this directory, and their contents, are
 
@@ -31,6 +29,12 @@ The files in this directory, and their contents, are
     md.c	The "magnetic disk" storage manager, which is really just
 		an interface to the kernel's filesystem operations.
 
+    undo_file.c The undo log storage manager.  This supports
+		buffer-pool based access to the contents of undo log
+		segment files.  It supports a limited subset of the
+		smgr interface: it can only read and write blocks of
+		existing files.
+
     smgrtype.c	Storage manager type -- maps string names to storage manager
 		IDs and provides simple comparison operators.  This is the
 		regproc support for type "smgr" in the system catalogs.
@@ -38,6 +42,9 @@ The files in this directory, and their contents, are
 		in the catalogs anymore.)
 
 Note that md.c in turn relies on src/backend/storage/file/fd.c.
+undo_file.c also uses fd.c to read and write blocks, but it expects
+src/backend/access/undo/undolog.c to manage the files holding those
+blocks.
 
 
 Relation Forks
-- 
2.17.0