From a7bbf41909d7d968f143b781fd26ae54817778ce Mon Sep 17 00:00:00 2001
From: Bruce Momjian <bruce@momjian.us>
Date: Tue, 6 Apr 2021 14:23:35 -0400
Subject: [PATCH] cfe-02-internaldoc_over_cfe-01-doc squash commit

---
 src/backend/crypto/README (new) | 231 ++++++++++++++++++++++++++++++++
 1 file changed, 231 insertions(+)

diff --git a/src/backend/crypto/README b/src/backend/crypto/README
new file mode 100644
index 0000000000..be5e5557ba
--- /dev/null
+++ b/src/backend/crypto/README
@@ -0,0 +1,231 @@
+Cluster File Encryption
+=======================
+
+This directory contains support functions and sample scripts to be used
+for cluster file encryption.
+
+Architecture
+------------
+
+Fundamentally, cluster file encryption must store data in a file system
+in such a way that the keys required to decrypt the file system data can
+only be accessed using somewhere outside of the file system itself.  The
+external requirement can be someone typing in a passphrase, getting a
+key from a key management server (KMS), or decrypting a key stored in
+the file system using a hardware security module (HSM).  The current
+architecture supports all of these methods, and includes sample scripts
+for them.
+
+The simplest method for accessing data keys using some external
+requirement would be to retrieve all data encryption keys from a KMS.
+However, retrieved keys would still need to be verified as valid.  This
+method also introduces unacceptable complexity for simpler use-cases,
+like user-supplied passphrases or HSM usage.  External key rotation
+would also be very hard since it would require re-encrypting all the
+file system data with the new externally-stored keys.
+
+For these reason, a two-tiered architecture is used, which uses two
+types of encryption keys: a key encryption key (KEK) and data encryption
+keys (DEK). The KEK should not be present unencrypted in the file system
+--- it should be supplied the user, stored externally (e.g., in a KMS)
+or stored in the file system encrypted with a HSM (e.g., PIV device).
+The DEK is used to encrypt database files and is stored in the same file
+system as the database but is encrypted using the KEK.  Because the DEK
+is encrypted, its storage in the file system is no more of a security
+weakness and the storage of the encrypted database files in the same
+file system.
+
+Implementation
+--------------
+
+To enable cluster file encryption, the initdb option
+--cluster-key-command must be used, which specifies a command to
+retrieve the KEK.  initdb records the cluster_key_command in
+postgresql.conf.  Every time the KEK is needed, the command is run and
+must return 64 hex characters which are decoded into the KEK.  The
+command is called twice during initdb, and every time the server starts.
+initdb also sets the encryption method in controldata during server
+bootstrap.
+
+initdb runs "postgres --boot", which calls function
+kmgr.c::BootStrapKmgr(), which calls the cluster key command.  The
+cluster key command returns a KEK which is used to encrypt random bytes
+for each DEK and writes them to the file system by
+kmgr.c::KmgrWriteCryptoKeys() (unless --copy-encryption-keys is used).
+Currently the DEK files are 0 and 1 and are stored in
+$PGDATA/pg_cryptokeys/live.  The wrapped DEK files use Key Wrapping with
+Padding which verifies the validity of the KEK.
+
+initdb also does a non-boot backend start which calls
+kmgr.c::InitializeKmgr(), which calls the cluster key command a second
+time.  This decrypts/unwraps the DEK keys and stores them in the shared
+memory structure KmgrShmem. This step also happens every time the server
+starts. Later patches will use the keys stored in KmgrShmem to
+encrypt/decrypt database files.  KmgrShmem is erased via
+explicit_bzero() on server shutdown.
+
+Limitations
+-----------
+
+There doesn't seem to be a reasonable way to detect all malicious data
+modification or key extraction if a user has write permission on the
+files in PGDATA. It might be possible to limit the key extraction risk
+if postgresql.auto.conf were able to be moved to a directory outside of
+PGDATA, and if postmaster.opts could be moved or ignored when cluster
+file encryption is used. (This file is used by pg_ctl restart.)
+
+It doesn't appear possible to detect all malicious writes --- even if
+you add message authentication code (MAC) checks to encrypted files,
+modifying non-encrypted files could still affect encrypted ones, e.g.,
+modifying files in pg_xact could affect how heap rows are interpreted.
+Basically you would need to encrypt all files, and at that point you
+might as well just use an encrypted file system. There also doesn't seem
+to be a way to prevent key extraction if someone has read permission on
+postgres process memory.
+
+Initialization Vector
+---------------------
+
+Nonce means "number used once". An Initialization Vector (IV) is a
+specific type of nonce. That is, unique but not necessarily random or
+secret, as specified by the NIST
+(https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf).
+To generate unique IVs, the NIST recommends two methods:
+
+	The first method is to apply the forward cipher function, under
+	the same key that is used for the encryption of the plaintext,
+	to a nonce. The nonce must be a data block that is unique to
+	each execution of the encryption operation. For example, the
+	nonce may be a counter, as described in Appendix B, or a message
+	number. The second method is to generate a random data block
+	using a FIPS-approved random number generator.
+
+We will use the first method to generate IVs. That is, select nonce
+carefully and use a cipher with the key to make it unique enough to use
+as an IV. The nonce selection for buffer encryption and WAL encryption
+are described below.
+
+If the IV was used more than once with the same key (and we only use one
+data encryption key), changes in the unencrypted data would be visible
+in the encrypted data.
+
+IV for Heap/Index Encryption
+- - - - - - - - - - - - - -
+
+To create the 16-byte IV needed by AES for each page version, we will
+use the page LSN (8 bytes) and page number (4 bytes).  In the remaining
+four bytes, one bit will be used to indicate if the LSN is WAL (real) or
+fake (see below). The LSN is ideal for use in the IV because it is
+always increasing, and is changed every time a page is updated.  The
+same LSN is never used for two relations with different page contents.
+
+However, the same LSN can be used in multiple pages in the same relation
+--- this can happen when a heap update expires an old tuple and adds a
+new tuple to another page.  By adding the page number to the IV, we keep
+the IV unique.
+
+By not using the database id in the IV, CREATE DATABASE can copy the
+heap/index files from the old database to a new one without
+decryption/encryption.  Both page copies are valid.  Once a database
+changes its pages, it gets new LSNs, and hence new IV.  Using only the
+LSN and page number also avoids requiring pg_upgrade to preserve
+database oids, tablespace oids, and relfilenodes.
+
+As part of WAL logging, every change of a WAL-logged page gets a new
+LSN, and therefore a new IV automatically.
+
+However, the LSN must then be visible on encrypted pages, so we will not
+encrypt the LSN on the page. We will also not encrypt the CRC so
+pg_checksums can still check pages offline without access to the keys.
+
+Non-Permanent Relations
+- - - - - - - - - - - -
+
+To avoid the overhead of generating WAL for non-permanent (unlogged and
+temporary) relations, we assign fake LSNs that are derived from a
+counter via xlog.c::GetFakeLSNForUnloggedRel().  (GiST also uses this
+counter for LSNs.)  We also set a bit in the IV so the use of the same
+value for WAL (real) and fake LSNs will still generate unique IVs.  Only
+main forks are encrypted, not init, vm, or fsm files.
+
+In the code, we need to identify if a page uses WAL or fake LSNs in
+four places, when:
+
+1.  Reading a page from the file system and decrypting
+2.  Setting the WAL or fake LSN on a page
+3.  Hint bits changes requiring new LSNs for the encryption IV
+4.  Encrypting and writing a page to the file system
+
+For all these case, we have access to the fork number and either the
+relation's persistence state or the buffer state.  If it is a "main"
+fork and the relation persistence state is RELPERSISTENCE_PERMANENT, or
+if it is an "init" fork, we use a real LSN.  If it is a main fork and
+RELPERSISTENCE_PERMANENT is false, we use a fake LSN.  The buffer state
+BM_PERMANENT is true if the relation is PERMANENT or is an init fork.
+
+Init Forks
+- - - - - 
+
+Init forks for unlogged relations get permanent LSNs because unlogged
+relation creation is WAL logged/crash safe, even though the relation's
+contents are not.  When the init fork is copied to represent an empty
+relation during crash recovery, it becomes a non-permanent page and must
+be successfully decrypted as such.  Therefore, when it is copied, its
+LSN is changed to e fake LSN and then encrypted.  This prevents a real
+LSN from being encrypted with the fake nonce bit.
+
+LSN Assignment, GiST, & Non-Permanent Relations
+- - - - - - - - - - - - - - - - - - - - - - - -
+
+LSN assignment has to be slightly modified for encryption.  In normal,
+non-encryption mode, LSNs are assigned to pages following these rules:
+
+1.  During GiST builds, some pages are assigned fixed LSNs (GistBuildLSN)
+
+2.  During GiST builds, non-permanent pages not assigned fixed LSNs in
+#1 are assigned fake LSNs, via gistutil.c::gistGetFakeLSN().
+
+3.  All other permanent pages are assigned WAL-based LSNs based on the
+WAL position of their WAL records.
+
+4.  All other non-permanent pages have LSNs of zero.
+
+When encryption is enabled:
+
+1.  During GiST builds, permanent pages are assigned WAL-based LSNs
+generated by xloginsert.c::LSNForEncryption().
+
+2.  During GiST builds, non-permanent pages are assigned fake LSNs. 
+(No constant LSNs are used in #1 or #2.)
+
+3.  same as #3 above
+
+4.  All other non-permanent pages are assigned fake LSNs before page
+encryption.
+
+When switching to an encrypted replica from a non-encrypted primary,
+GiST indexes will be using fixed LSNs for permanent tables, so it is
+recommended to rebuild GiST indexes.  Non-permanent relations are not
+replicated, so they are not an issue.
+
+Hint Bits
+- - - - -
+
+For hint bit changes, the LSN normally doesn't change, which is a
+problem.  By enabling wal_log_hints, you get full page writes to the WAL
+after the first hint bit change of the checkpoint.  This is useful for
+two reasons.  First, it generates a new LSN, which is needed for the IV
+to be secure.  Second, full page images protect against torn pages,
+which is an even bigger requirement for encryption because the new LSN
+is re-encrypting the entire page, not just the hint bit changes.  You
+can safely lose the hint bit changes, but you need to use the same LSN
+to decrypt the entire page, so a torn page with an LSN change cannot be
+decrypted.  To prevent this, wal_log_hints guarantees that the
+pre-hint-bit version (and previous LSN version) of the page is restored.
+
+However, if a hint-bit-modified page is written to the file system
+during a checkpoint, and there is a later hint bit change switching the
+same page from clean to dirty during the same checkpoint, we need a new
+LSN, and wal_log_hints doesn't give us a new LSN here.  The fix for this
+is to update the page LSN by writing a dummy WAL record via
+xloginsert.c::LSNForEncryption() in such cases.
-- 
2.20.1

